1. 14 8月, 2012 2 次提交
    • M
      sched,cgroup: Fix up task_groups list · 35cf4e50
      Mike Galbraith 提交于
      With multiple instances of task_groups, for_each_rt_rq() is a noop,
      no task groups having been added to the rt.c list instance.  This
      renders __enable/disable_runtime() and print_rt_stats() noop, the
      user (non) visible effect being that rt task groups are missing in
      /proc/sched_debug.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@kernel.org # v3.3+
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      35cf4e50
    • S
      sched: fix divide by zero at {thread_group,task}_times · bea6832c
      Stanislaw Gruszka 提交于
      On architectures where cputime_t is 64 bit type, is possible to trigger
      divide by zero on do_div(temp, (__force u32) total) line, if total is a
      non zero number but has lower 32 bit's zeroed. Removing casting is not
      a good solution since some do_div() implementations do cast to u32
      internally.
      
      This problem can be triggered in practice on very long lived processes:
      
        PID: 2331   TASK: ffff880472814b00  CPU: 2   COMMAND: "oraagent.bin"
         #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
         #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
         #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
         #3 [ffff880472a51cd0] die at ffffffff8100f26b
         #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
         #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
         #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
            [exception RIP: thread_group_times+0x56]
            RIP: ffffffff81056a16  RSP: ffff880472a51eb8  RFLAGS: 00010046
            RAX: bc3572c9fe12d194  RBX: ffff880874150800  RCX: 0000000110266fad
            RDX: 0000000000000000  RSI: ffff880472a51eb8  RDI: 001038ae7d9633dc
            RBP: ffff880472a51ef8   R8: 00000000b10a3a64   R9: ffff880874150800
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: ffff880472a51f08
            R13: ffff880472a51f10  R14: 0000000000000000  R15: 0000000000000007
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
         #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
         #8 [ffff880472a51f40] sys_times at ffffffff81088524
         #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
            RIP: 0000003808caac3a  RSP: 00007fcba27ab6d8  RFLAGS: 00000202
            RAX: 0000000000000064  RBX: ffffffff8100b0f2  RCX: 0000000000000000
            RDX: 00007fcba27ab6e0  RSI: 000000000076d58e  RDI: 00007fcba27ab6e0
            RBP: 00007fcba27ab700   R8: 0000000000000020   R9: 000000000000091b
            R10: 00007fcba27ab680  R11: 0000000000000202  R12: 00007fff9ca41940
            R13: 0000000000000000  R14: 00007fcba27ac9c0  R15: 00007fff9ca41940
            ORIG_RAX: 0000000000000064  CS: 0033  SS: 002b
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bea6832c
  2. 26 7月, 2012 2 次提交
  3. 24 7月, 2012 4 次提交
    • P
      sched: Fix race in task_group() · 8323f26c
      Peter Zijlstra 提交于
      Stefan reported a crash on a kernel before a3e5d109 ("sched:
      Don't call task_group() too many times in set_task_rq()"), he
      found the reason to be that the multiple task_group()
      invocations in set_task_rq() returned different values.
      
      Looking at all that I found a lack of serialization and plain
      wrong comments.
      
      The below tries to fix it using an extra pointer which is
      updated under the appropriate scheduler locks. Its not pretty,
      but I can't really see another way given how all the cgroup
      stuff works.
      Reported-and-tested-by: NStefan Bader <stefan.bader@canonical.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8323f26c
    • M
      sched: Improve scalability via 'CPU buddies', which withstand random perturbations · 970e1789
      Mike Galbraith 提交于
      Traversing an entire package is not only expensive, it also leads to tasks
      bouncing all over a partially idle and possible quite large package.  Fix
      that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
      to motivate that one other CPU, if it's busy, tough, it may then try its
      SMT sibling, but that's all this optimization is allowed to cost.
      
      Sibling cache buddies are cross-wired to prevent bouncing.
      
      4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
      
       clients     1       2       4        8       16       32       64      128
       ..........................................................................
       pre        30      41     118      645     3769     6214    12233    14312
       post      299     603    1211     2418     4697     6847    11606    14557
      
      A nice increase in performance.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      970e1789
    • S
      cpusets, hotplug: Restructure functions that are invoked during hotplug · 7ddf96b0
      Srivatsa S. Bhat 提交于
      Separate out the cpuset related handling for CPU/Memory online/offline.
      This also helps us exploit the most obvious and basic level of optimization
      that any notification mechanism (CPU/Mem online/offline) has to offer us:
      "We *know* why we have been invoked. So stop pretending that we are lost,
      and do only the necessary amount of processing!".
      
      And while at it, rename scan_for_empty_cpusets() to
      scan_cpusets_upon_hotplug(), which is more appropriate considering how
      it is restructured.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7ddf96b0
    • S
      CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume · d35be8ba
      Srivatsa S. Bhat 提交于
      In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
      masks as and when necessary to ensure that the tasks belonging to the cpusets
      have some place (online CPUs) to run on. And regular CPU hotplug is
      destructive in the sense that the kernel doesn't remember the original cpuset
      configurations set by the user, across hotplug operations.
      
      However, suspend/resume (which uses CPU hotplug) is a special case in which
      the kernel has the responsibility to restore the system (during resume), to
      exactly the same state it was in before suspend.
      
      In order to achieve that, do the following:
      
      1. Don't modify cpusets during suspend/resume. At all.
         In particular, don't move the tasks from one cpuset to another, and
         don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
         during the CPU hotplug operations that are carried out in the
         suspend/resume path.
      
      2. However, cpusets and sched domains are related. We just want to avoid
         altering cpusets alone. So, to keep the sched domains updated, build
         a single sched domain (containing all active cpus) during each of the
         CPU hotplug operations carried out in s/r path, effectively ignoring
         the cpusets' cpus_allowed masks.
      
         (Since userspace is frozen while doing all this, it will go unnoticed.)
      
      3. During the last CPU online operation during resume, build the sched
         domains by looking up the (unaltered) cpusets' cpus_allowed masks.
         That will bring back the system to the same original state as it was in
         before suspend.
      
      Ultimately, this will not only solve the cpuset problem related to suspend
      resume (ie., restores the cpusets to exactly what it was before suspend, by
      not touching it at all) but also speeds up suspend/resume because we avoid
      running cpuset update code for every CPU being offlined/onlined.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d35be8ba
  4. 06 7月, 2012 1 次提交
  5. 03 7月, 2012 1 次提交
  6. 06 6月, 2012 5 次提交
  7. 30 5月, 2012 7 次提交
  8. 23 5月, 2012 1 次提交
    • J
      Revert "sched, perf: Use a single callback into the scheduler" · ab0cce56
      Jiri Olsa 提交于
      This reverts commit cb04ff9a ("sched, perf: Use a single
      callback into the scheduler").
      
      Before this change was introduced, the process switch worked
      like this (wrt. to perf event schedule):
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - switch to next
             - schedule in all perf events for current (next)
      
      After the commit, the process switch looks like:
      
           schedule (prev, next)
             - schedule out all perf events for prev
             - schedule in all perf events for (next)
             - switch to next
      
      The problem is, that after we schedule perf events in, the pmu
      is enabled and we can receive events even before we make the
      switch to next - so "current" still being prev process (event
      SAMPLE data are filled based on the value of the "current"
      process).
      
      Thats exactly what we see for test__PERF_RECORD test. We receive
      SAMPLES with PID of the process that our tracee is scheduled
      from.
      
      Discussed with Peter Zijlstra:
      
       > Bah!, yeah I guess reverting is the right thing for now. Sad
       > though.
       >
       > So by having the two hooks we have a black-spot between them
       > where we receive no events at all, this black-spot covers the
       > hand-over of current and we thus don't receive the 'wrong'
       > events.
       >
       > I rather liked we could do away with both that black-spot and
       > clean up the code a little, but apparently people rely on it.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: acme@redhat.com
      Cc: paulus@samba.org
      Cc: cjashfor@linux.vnet.ibm.com
      Cc: fweisbec@gmail.com
      Cc: eranian@google.com
      Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ab0cce56
  9. 18 5月, 2012 1 次提交
  10. 17 5月, 2012 1 次提交
    • P
      sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc
      Peter Zijlstra 提交于
      It's been broken forever (i.e. it's not scheduling in a power
      aware fashion), as reported by Suresh and others sending
      patches, and nobody cares enough to fix it properly ...
      so remove it to make space free for something better.
      
      There's various problems with the code as it stands today, first
      and foremost the user interface which is bound to topology
      levels and has multiple values per level. This results in a
      state explosion which the administrator or distro needs to
      master and almost nobody does.
      
      Furthermore large configuration state spaces aren't good, it
      means the thing doesn't just work right because it's either
      under so many impossibe to meet constraints, or even if
      there's an achievable state workloads have to be aware of
      it precisely and can never meet it for dynamic workloads.
      
      So pushing this kind of decision to user-space was a bad idea
      even with a single knob - it's exponentially worse with knobs
      on every node of the topology.
      
      There is a proposal to replace the user interface with a single
      3 state knob:
      
       sched_balance_policy := { performance, power, auto }
      
      where 'auto' would be the preferred default which looks at things
      like Battery/AC mode and possible cpufreq state or whatever the hw
      exposes to show us power use expectations - but there's been no
      progress on it in the past many months.
      
      Aside from that, the actual implementation of the various knobs
      is known to be broken. There have been sporadic attempts at
      fixing things but these always stop short of reaching a mergable
      state.
      
      Therefore this wholesale removal with the hopes of spurring
      people who care to come forward once again and work on a
      coherent replacement.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e7fbcbc
  11. 14 5月, 2012 4 次提交
  12. 09 5月, 2012 4 次提交
  13. 03 5月, 2012 2 次提交
  14. 26 4月, 2012 2 次提交
    • H
      sched: Fix OOPS when build_sched_domains() percpu allocation fails · fb2cf2c6
      he, bo 提交于
      Under extreme memory used up situations, percpu allocation
      might fail. We hit it when system goes to suspend-to-ram,
      causing a kworker panic:
      
       EIP: [<c124411a>] build_sched_domains+0x23a/0xad0
       Kernel panic - not syncing: Fatal exception
       Pid: 3026, comm: kworker/u:3
       3.0.8-137473-gf42fbef #1
      
       Call Trace:
        [<c18cc4f2>] panic+0x66/0x16c
        [...]
        [<c1244c37>] partition_sched_domains+0x287/0x4b0
        [<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210
        [<c123712d>] cpuset_cpu_inactive+0x1d/0x30
        [...]
      
      With this fix applied build_sched_domains() will return -ENOMEM and
      the suspend attempt fails.
      Signed-off-by: Nhe, bo <bo.he@intel.com>
      Reviewed-by: NZhang, Yanmin <yanmin.zhang@intel.com>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo
      [ So, we fail to deallocate a CPU because we cannot allocate RAM :-/
        I don't like that kind of sad behavior but nevertheless it should
        not crash under high memory load. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      fb2cf2c6
    • T
      smp: Provide generic idle thread allocation · 29d5e047
      Thomas Gleixner 提交于
      All SMP architectures have magic to fork the idle task and to store it
      for reusage when cpu hotplug is enabled. Provide a generic
      infrastructure for it.
      
      Create/reinit the idle thread for the cpu which is brought up in the
      generic code and hand the thread pointer to the architecture code via
      __cpu_up().
      
      Note, that fork_idle() is called via a workqueue, because this
      guarantees that the idle thread does not get a reference to a user
      space VM. This can happen when the boot process did not bring up all
      possible cpus and a later cpu_up() is initiated via the sysfs
      interface. In that case fork_idle() would be called in the context of
      the user space task and take a reference on the user space VM.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James E.J. Bottomley <jejb@parisc-linux.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: x86@kernel.org
      Acked-by: NVenkatesh Pallipadi <venki@google.com>
      Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de
      29d5e047
  15. 08 4月, 2012 1 次提交
  16. 02 4月, 2012 1 次提交
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
  17. 31 3月, 2012 1 次提交