1. 11 12月, 2012 1 次提交
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
  2. 05 10月, 2012 2 次提交
    • T
      sched: Update sched_domains_numa_masks[][] when new cpus are onlined · 301a5cba
      Tang Chen 提交于
      Once array sched_domains_numa_masks[] []is defined, it is never updated.
      
      When a new cpu on a new node is onlined, the coincident member in
      sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
      As a result, the build_overlap_sched_groups() will initialize a NULL
      sched_group for the new cpu on the new node, which will lead to kernel panic:
      
      [ 3189.403280] Call Trace:
      [ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
      [ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
      [ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
      [ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
      [ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
      [ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
      [ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
      [ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
      [ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
      [ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
      [ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
      [ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
      [ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
      [ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
      [ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
      [ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
      [ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
      [ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
      [ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
      [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
      [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      
      This patch registers a new notifier for cpu hotplug notify chain, and
      updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      [ fixed compile warning ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      301a5cba
    • T
      sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions · 5f7865f3
      Tang Chen 提交于
      We should temporarily reset 'sched_domains_numa_levels' to 0 after
      it is reset to 'level' in sched_init_numa(). If it fails to allocate
      memory for array sched_domains_numa_masks[][], the array will contain
      less then 'level' members. This could be dangerous when we use it to
      iterate array sched_domains_numa_masks[][] in other functions.
      
      This patch set sched_domains_numa_levels to 0 before initializing
      array sched_domains_numa_masks[][], and reset it to 'level' when
      sched_domains_numa_masks[][] is fully initialized.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f7865f3
  3. 01 10月, 2012 1 次提交
    • A
      sanitize tsk_is_polling() · 16a80163
      Al Viro 提交于
      Make default just return 0.  The current default (checking
      TIF_POLLING_NRFLAG) is taken to architectures that need it;
      ones that don't do polling in their idle threads don't need
      to defined TIF_POLLING_NRFLAG at all.
      
      ia64 defined both TS_POLLING (used by its tsk_is_polling())
      and TIF_POLLING_NRFLAG (not used at all).  Killed the latter...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      16a80163
  4. 26 9月, 2012 3 次提交
    • F
      rcu: Exit RCU extended QS on user preemption · 20ab65e3
      Frederic Weisbecker 提交于
      When exceptions or irq are about to resume userspace, if
      the task needs to be rescheduled, the arch low level code
      calls schedule() directly.
      
      If we call it, it is because we have the TIF_RESCHED flag:
      
      - It can be set after random local calls to set_need_resched()
      (RCU, drm, ...)
      
      - A wake up happened and the CPU needs preemption. This can
        happen in several ways:
      
          * Remotely: the remote waking CPU has set TIF_RESCHED and send the
            wakee an IPI to schedule the new task.
          * Remotely enqueued: the remote waking CPU sends an IPI to the target
            and the wake up is made by the target.
          * Locally: waking CPU == wakee CPU and the wakeup is done locally.
            set_need_resched() is called without IPI.
      
      In the case of local and remotely enqueued wake ups, the tick can
      be restarted when we enqueue the new task and RCU can exit the
      extended quiescent state at the same time. Then by the time we reach
      irq exit path and we call schedule, we are not in RCU user mode.
      
      But if we call schedule() only because something called set_need_resched(),
      RCU may still be in user mode when we reach schedule.
      
      Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
      flag and call schedule while the IPI has not yet happen to restart the
      tick and exit RCU user mode.
      
      We need to manually protect against these corner cases.
      
      Create a new API schedule_user() that calls schedule() inside
      rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
      will need to rely on it now to implement user preemption safely.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      20ab65e3
    • F
      rcu: Exit RCU extended QS on kernel preemption after irq/exception · 90a340ed
      Frederic Weisbecker 提交于
      When an exception or an irq exits, and we are going to resume into
      interrupted kernel code, the low level architecture code calls
      preempt_schedule_irq() if there is a need to reschedule.
      
      If the interrupt/exception occured between a call to rcu_user_enter()
      (from syscall exit, exception exit, do_notify_resume exit, ...) and
      a real resume to userspace (iret,...), preempt_schedule_irq() can be
      called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
      is going to run kernel code and may be some RCU read side critical
      section. We must exit the userspace extended quiescent state before
      we call it.
      
      To solve this, just call rcu_user_exit() in the beginning of
      preempt_schedule_irq().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      90a340ed
    • F
      rcu: Switch task's syscall hooks on context switch · 04e7e951
      Frederic Weisbecker 提交于
      Clear the syscalls hook of a task when it's scheduled out so that if
      the task migrates, it doesn't run the syscall slow path on a CPU
      that might not need it.
      
      Also set the syscalls hook on the next task if needed.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      04e7e951
  5. 25 9月, 2012 2 次提交
    • F
      vtime: Consolidate system/idle context detection · a7e1a9e3
      Frederic Weisbecker 提交于
      Move the code that finds out to which context we account the
      cputime into generic layer.
      
      Archs that consider the whole time spent in the idle task as idle
      time (ia64, powerpc) can rely on the generic vtime_account()
      and implement vtime_account_system() and vtime_account_idle(),
      letting the generic code to decide when to call which API.
      
      Archs that have their own meaning of idle time, such as s390
      that only considers the time spent in CPU low power mode as idle
      time, can just override vtime_account().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a7e1a9e3
    • F
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker 提交于
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
  6. 23 9月, 2012 1 次提交
    • P
      sched: Fix load avg vs cpu-hotplug · 5d180232
      Peter Zijlstra 提交于
      Rabik and Paul reported two different issues related to the same few
      lines of code.
      
      Rabik's issue is that the nr_uninterruptible migration code is wrong in
      that he sees artifacts due to this (Rabik please do expand in more
      detail).
      
      Paul's issue is that this code as it stands relies on us using
      stop_machine() for unplug, we all would like to remove this assumption
      so that eventually we can remove this stop_machine() usage altogether.
      
      The only reason we'd have to migrate nr_uninterruptible is so that we
      could use for_each_online_cpu() loops in favour of
      for_each_possible_cpu() loops, however since nr_uninterruptible() is the
      only such loop and its using possible lets not bother at all.
      
      The problem Rabik sees is (probably) caused by the fact that by
      migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
      involved.
      
      So don't bother with fancy migration schemes (meaning we now have to
      keep using for_each_possible_cpu()) and instead fold any nr_active delta
      after we migrate all tasks away to make sure we don't have any skewed
      nr_active accounting.
      
      [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
      miscounting noted by Rakib. ]
      Reported-by: NRakib Mullick <rakib.mullick@gmail.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      5d180232
  7. 17 9月, 2012 1 次提交
  8. 13 9月, 2012 5 次提交
  9. 04 9月, 2012 6 次提交
  10. 20 8月, 2012 2 次提交
    • F
      cputime: Consolidate vtime handling on context switch · baa36046
      Frederic Weisbecker 提交于
      The archs that implement virtual cputime accounting all
      flush the cputime of a task when it gets descheduled
      and sometimes set up some ground initialization for the
      next task to account its cputime.
      
      These archs all put their own hooks in their context
      switch callbacks and handle the off-case themselves.
      
      Consolidate this by creating a new account_switch_vtime()
      callback called in generic code right after a context switch
      and that these archs must implement to flush the prev task
      cputime and initialize the next task cputime related state.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      baa36046
    • F
      sched: Move cputime code to its own file · 73fbec60
      Frederic Weisbecker 提交于
      Extract cputime code from the giant sched/core.c and
      put it in its own file. This make it easier to deal with
      this particular area and de-bloat a bit more core.c
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      73fbec60
  11. 14 8月, 2012 9 次提交
  12. 31 7月, 2012 1 次提交
  13. 26 7月, 2012 3 次提交
  14. 24 7月, 2012 3 次提交
    • P
      sched: Fix race in task_group() · 8323f26c
      Peter Zijlstra 提交于
      Stefan reported a crash on a kernel before a3e5d109 ("sched:
      Don't call task_group() too many times in set_task_rq()"), he
      found the reason to be that the multiple task_group()
      invocations in set_task_rq() returned different values.
      
      Looking at all that I found a lack of serialization and plain
      wrong comments.
      
      The below tries to fix it using an extra pointer which is
      updated under the appropriate scheduler locks. Its not pretty,
      but I can't really see another way given how all the cgroup
      stuff works.
      Reported-and-tested-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8323f26c
    • S
      sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task · 88b8dac0
      Srivatsa Vaddagiri 提交于
      Current load balance scheme requires only one cpu in a
      sched_group (balance_cpu) to look at other peer sched_groups for
      imbalance and pull tasks towards itself from a busy cpu. Tasks
      thus pulled by balance_cpu could later get picked up by cpus
      that are in the same sched_group as that of balance_cpu.
      
      This scheme however fails to pull tasks that are not allowed to
      run on balance_cpu (but are allowed to run on other cpus in its
      sched_group). That can affect fairness and in some worst case
      scenarios cause starvation.
      
      Consider a two core (2 threads/core) system running tasks as
      below:
      
                Core0            Core1
               /     \          /     \
      	C0     C1	 C2     C3
              |      |         |      |
              v      v         v      v
      	F0     T1        F1     [idle]
      			 T2
      
       F0 = SCHED_FIFO task (pinned to C0)
       F1 = SCHED_FIFO task (pinned to C2)
       T1 = SCHED_OTHER task (pinned to C1)
       T2 = SCHED_OTHER task (pinned to C1 and C2)
      
      F1 could become a cpu hog, which will starve T2 unless C1 pulls
      it. Between C0 and C1 however, C0 is required to look for
      imbalance between cores, which will fail to pull T2 towards
      Core0. T2 will starve eternally in this case. The same scenario
      can arise in presence of non-rt tasks as well (say we replace F1
      with high irq load).
      
      We tackle this problem by having balance_cpu move pinned tasks
      to one of its sibling cpus (where they can run). We first check
      if load balance goal can be met by ignoring pinned tasks,
      failing which we retry move_tasks() with a new env->dst_cpu.
      
      This patch modifies load balance semantics on who can move load
      towards a given cpu in a given sched_domain.
      
      Before this patch, a given_cpu or a ilb_cpu acting on behalf of
      an idle given_cpu is responsible for moving load to given_cpu.
      
      With this patch applied, balance_cpu can in addition decide on
      moving some load to a given_cpu.
      
      There is a remote possibility that excess load could get moved
      as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
      *independently* and at *same* time to move some load to a
      given_cpu). However we should see less of such conflicting
      decisions in practice and moreover subsequent load balance
      cycles should correct the excess load moved to given_cpu.
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
      [ minor edits ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88b8dac0
    • P
      sched: Reset loop counters if all tasks are pinned and we need to redo load balance · bbf18b19
      Prashanth Nageshappa 提交于
      While load balancing, if all tasks on the source runqueue are pinned,
      we retry after excluding the corresponding source cpu. However, loop counters
      env.loop and env.loop_break are not reset before retrying, which can lead
      to failure in moving the tasks. In this patch we reset env.loop and
      env.loop_break to their inital values before we retry.
      Signed-off-by: NPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bbf18b19