1. 14 8月, 2012 1 次提交
    • P
      sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466
      Peter Zijlstra 提交于
      Peter Portante reported that for large cgroup hierarchies (and or on
      large CPU counts) we get immense lock contention on rq->lock and stuff
      stops working properly.
      
      His workload was a ton of processes, each in their own cgroup,
      everybody idling except for a sporadic wakeup once every so often.
      
      It was found that:
      
        schedule()
          idle_balance()
            load_balance()
              local_irq_save()
              double_rq_lock()
              update_h_load()
                walk_tg_tree(tg_load_down)
                  tg_load_down()
      
      Results in an entire cgroup hierarchy walk under rq->lock for every
      new-idle balance and since new-idle balance isn't throttled this
      results in a lot of work while holding the rq->lock.
      
      This patch does two things, it removes the work from under rq->lock
      based on the good principle of race and pray which is widely employed
      in the load-balancer as a whole. And secondly it throttles the
      update_h_load() calculation to max once per jiffy.
      
      I considered excluding update_h_load() for new-idle balance
      all-together, but purely relying on regular balance passes to update
      this data might not work out under some rare circumstances where the
      new-idle busiest isn't the regular busiest for a while (unlikely, but
      a nightmare to debug if someone hits it and suffers).
      
      Cc: pjt@google.com
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NPeter Portante <pportant@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a35b6466
  2. 31 7月, 2012 1 次提交
  3. 24 7月, 2012 4 次提交
  4. 09 6月, 2012 1 次提交
    • R
      sched/fair: fix lots of kernel-doc warnings · cd96891d
      Randy Dunlap 提交于
      Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
      
        Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
        Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
        Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
        Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
        Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
        .. more warnings
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd96891d
  5. 06 6月, 2012 2 次提交
  6. 30 5月, 2012 3 次提交
  7. 17 5月, 2012 1 次提交
    • P
      sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc
      Peter Zijlstra 提交于
      It's been broken forever (i.e. it's not scheduling in a power
      aware fashion), as reported by Suresh and others sending
      patches, and nobody cares enough to fix it properly ...
      so remove it to make space free for something better.
      
      There's various problems with the code as it stands today, first
      and foremost the user interface which is bound to topology
      levels and has multiple values per level. This results in a
      state explosion which the administrator or distro needs to
      master and almost nobody does.
      
      Furthermore large configuration state spaces aren't good, it
      means the thing doesn't just work right because it's either
      under so many impossibe to meet constraints, or even if
      there's an achievable state workloads have to be aware of
      it precisely and can never meet it for dynamic workloads.
      
      So pushing this kind of decision to user-space was a bad idea
      even with a single knob - it's exponentially worse with knobs
      on every node of the topology.
      
      There is a proposal to replace the user interface with a single
      3 state knob:
      
       sched_balance_policy := { performance, power, auto }
      
      where 'auto' would be the preferred default which looks at things
      like Battery/AC mode and possible cpufreq state or whatever the hw
      exposes to show us power use expectations - but there's been no
      progress on it in the past many months.
      
      Aside from that, the actual implementation of the various knobs
      is known to be broken. There have been sporadic attempts at
      fixing things but these always stop short of reaching a mergable
      state.
      
      Therefore this wholesale removal with the hopes of spurring
      people who care to come forward once again and work on a
      coherent replacement.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e7fbcbc
  8. 14 5月, 2012 3 次提交
    • P
      sched/fair: Improve the ->group_imb logic · e44bc5c5
      Peter Zijlstra 提交于
      Group imbalance is meant to deal with situations where affinity masks
      and sched domains don't align well, such as 3 cpus from one group and
      6 from another. In this case the domain based balancer will want to
      put an equal amount of tasks on each side even though they don't have
      equal cpus.
      
      Currently group_imb is set whenever two cpus of a group have a weight
      difference of at least one avg task and the heaviest cpu has at least
      two tasks. A group with imbalance set will always be picked as busiest
      and a balance pass will be forced.
      
      The problem is that even if there are no affinity masks this stuff can
      trigger and cause weird balancing decisions, eg. the observed
      behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
      difference of 1 avg load (they all had the same weight) and nr_running
      being >1 the group_imbalance logic triggered and did the weird thing
      of pulling more load instead of trying to move the 1 excess task to
      the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
      1 task.
      
      Curb the group_imbalance stuff by making the nr_running condition
      weaker by also tracking the min_nr_running and using the difference in
      nr_running over the set instead of the absolute max nr_running.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e44bc5c5
    • P
      sched/nohz: Fix rq->cpu_load[] calculations · 556061b0
      Peter Zijlstra 提交于
      While investigating why the load-balancer did funny I found that the
      rq->cpu_load[] tables were completely screwy.. a bit more digging
      revealed that the updates that got through were missing ticks followed
      by a catchup of 2 ticks.
      
      The catchup assumes the cpu was idle during that time (since only nohz
      can cause missed ticks and the machine is idle etc..) this means that
      esp. the higher indices were significantly lower than they ought to
      be.
      
      The reason for this is that its not correct to compare against jiffies
      on every jiffy on any other cpu than the cpu that updates jiffies.
      
      This patch cludges around it by only doing the catch-up stuff from
      nohz_idle_balance() and doing the regular stuff unconditionally from
      the tick.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: pjt@google.com
      Cc: Venkatesh Pallipadi <venki@google.com>
      Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      556061b0
    • P
      sched/fair: Revert sched-domain iteration breakage · 04f733b4
      Peter Zijlstra 提交于
      Patches c22402a2 ("sched/fair: Let minimally loaded cpu balance the
      group") and 0ce90475 ("sched/fair: Add some serialization to the
      sched_domain load-balance walk") are horribly broken so revert them.
      
      The problem is that while it sounds good to have the minimally loaded
      cpu do the pulling of more load, the way we walk the domains there is
      absolutely no guarantee this cpu will actually get to the domain. In
      fact its very likely it wont. Therefore the higher up the tree we get,
      the less likely it is we'll balance at all.
      
      The first of mask always walks up, while sucky in that it accumulates
      load on the first cpu and needs extra passes to spread it out at least
      guarantees a cpu gets up that far and load-balancing happens at all.
      
      Since its now always the first and idle cpus should always be able to
      balance so they get a task as fast as possible we can also do away
      with the added serialization.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      04f733b4
  9. 09 5月, 2012 4 次提交
  10. 26 4月, 2012 1 次提交
  11. 23 3月, 2012 1 次提交
  12. 13 3月, 2012 2 次提交
  13. 01 3月, 2012 3 次提交
  14. 24 2月, 2012 1 次提交
    • I
      static keys: Introduce 'struct static_key', static_key_true()/false() and... · c5905afb
      Ingo Molnar 提交于
      static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]()
      
      So here's a boot tested patch on top of Jason's series that does
      all the cleanups I talked about and turns jump labels into a
      more intuitive to use facility. It should also address the
      various misconceptions and confusions that surround jump labels.
      
      Typical usage scenarios:
      
              #include <linux/static_key.h>
      
              struct static_key key = STATIC_KEY_INIT_TRUE;
      
              if (static_key_false(&key))
                      do unlikely code
              else
                      do likely code
      
      Or:
      
              if (static_key_true(&key))
                      do likely code
              else
                      do unlikely code
      
      The static key is modified via:
      
              static_key_slow_inc(&key);
              ...
              static_key_slow_dec(&key);
      
      The 'slow' prefix makes it abundantly clear that this is an
      expensive operation.
      
      I've updated all in-kernel code to use this everywhere. Note
      that I (intentionally) have not pushed through the rename
      blindly through to the lowest levels: the actual jump-label
      patching arch facility should be named like that, so we want to
      decouple jump labels from the static-key facility a bit.
      
      On non-jump-label enabled architectures static keys default to
      likely()/unlikely() branches.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: a.p.zijlstra@chello.nl
      Cc: mathieu.desnoyers@efficios.com
      Cc: davem@davemloft.net
      Cc: ddaney.cavm@gmail.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c5905afb
  15. 22 2月, 2012 2 次提交
  16. 31 1月, 2012 1 次提交
  17. 27 1月, 2012 2 次提交
  18. 12 1月, 2012 1 次提交
  19. 24 12月, 2011 1 次提交
  20. 21 12月, 2011 5 次提交