提交 · e14808b49f55e0e1135da5e4a154a540dd9f3662 · openanolis / cloud-kernel

11 12月, 2012 5 次提交

mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4

由 Mel Gorman 提交于 11月 19, 2012

If there are a large number of NUMA hinting faults and all of them
are resulting in migrations it may indicate that memory is just
bouncing uselessly around. NUMA balancing cost is likely exceeding
any benefit from locality. Rate limit the PTE updates if the node
is migration rate-limited. As noted in the comments, this distorts
the NUMA faulting statistics.
Signed-off-by: NMel Gorman <mgorman@suse.de>

e14808b4

mm: sched: numa: Implement slow start for working set sampling · 4b96a29b

由 Peter Zijlstra 提交于 10月 25, 2012

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
  the initial scan would happen much later still, in effect that
  patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

   # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

   !NUMA:
   45.291088843 seconds time elapsed                                          ( +-  0.40% )
   45.154231752 seconds time elapsed                                          ( +-  0.36% )

   +NUMA, no slow start:
   46.172308123 seconds time elapsed                                          ( +-  0.30% )
   46.343168745 seconds time elapsed                                          ( +-  0.25% )

   +NUMA, 1 sec slow start:
   45.224189155 seconds time elapsed                                          ( +-  0.25% )
   45.160866532 seconds time elapsed                                          ( +-  0.17% )

and it also fixes an observable perf bench (hackbench) regression:

   # perf stat --null --repeat 10 perf bench sched messaging

   -NUMA:

   -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
   +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )

   +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
knob.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>

4b96a29b

sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c

由 Mel Gorman 提交于 11月 14, 2012

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.
Suggested-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NMel Gorman <mgorman@suse.de>

9f40604c

mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223

由 Peter Zijlstra 提交于 10月 25, 2012

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

 - it samples the working set at dissimilar rates,
   giving some tasks a sampling quality advantage
   over others.

 - creates performance problems for tasks with very
   large working sets

 - over-samples processes with large address spaces but
   which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 100ms up to just once per 8 seconds.  The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
  rate of the 'hinting page fault'. AutoNUMA's scanning is
  currently rate-limited, but it is also fundamentally
  single-threaded, executing in the knuma_scand kernel thread,
  so the limit in AutoNUMA is global and does not scale up with
  the number of CPUs, nor does it scan tasks in an execution
  proportional manner.

  So the idea of rate-limiting the scanning was first implemented
  in the AutoNUMA tree via a global rate limit. This patch goes
  beyond that by implementing an execution rate proportional
  working set sampling rate that is not implemented via a single
  global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
  first version of this patch. ]
Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
[ Wrote changelog and fixed bug. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>

6e5fb223

mm: numa: Add fault driven placement and migration · cbee9f88

由 Peter Zijlstra 提交于 10月 25, 2012

NOTE: This patch is based on "sched, numa, mm: Add fault driven
placement and migration policy" but as it throws away all the policy
to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on. In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>

cbee9f88

17 9月, 2012 1 次提交

Revert "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" · 37407ea7

由 Linus Torvalds 提交于 9月 16, 2012

This reverts commit 970e1789.

Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e1789 ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.

Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.

There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.
Reported-by: NNikolay Ulyanitsky <lystor@gmail.com>
Bisected-by: NBorislav Petkov <bp@alien8.de>
Acked-by: NMike Galbraith <efault@gmx.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

37407ea7

13 9月, 2012 2 次提交

sched/nohz: Clean up select_nohz_load_balancer() · c1cc017c

由 Alex Shi 提交于 9月 10, 2012

There is no load_balancer to be selected now. It just sets the
state of the nohz tick to stop.

So rename the function, pass the 'cpu' as a parameter and then
remove the useless call from tick_nohz_restart_sched_tick().

[ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
  s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Acked-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

c1cc017c

sched: Fix nohz_idle_balance() · 5ed4f1d9

由 Vincent Guittot 提交于 9月 13, 2012

On tickless systems, one CPU runs load balance for all idle CPUs.

The cpu_load of this CPU is updated before starting the load balance
of each other idle CPUs. We should instead update the cpu_load of
the balance_cpu.
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

5ed4f1d9

04 9月, 2012 2 次提交

sched: Fix kernel-doc warnings in kernel/sched/fair.c · 9450d57e

由 Randy Dunlap 提交于 8月 18, 2012

Fix two kernel-doc warnings in kernel/sched/fair.c:

Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'
Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

9450d57e

sched: Unthrottle rt runqueues in __disable_runtime() · a4c96ae3

由 Peter Boonstoppel 提交于 8月 09, 2012

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()
Signed-off-by: NPeter Boonstoppel <pboonstoppel@nvidia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

a4c96ae3

14 8月, 2012 4 次提交

sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean up · f03542a7

由 Alex Shi 提交于 7月 26, 2012

Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.

So, clean up the obsolete code, includes SD_PREFER_LOCAL.
Signed-off-by: NAlex Shi <alex.shi@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

f03542a7

sched: using dst_rq instead of this_rq during load balance · 78feefc5

由 Michael Wang 提交于 8月 06, 2012

As we already have dst_rq in lb_env, using or changing "this_rq" do not
make sense.

This patch will replace "this_rq" with dst_rq in load_balance, and we
don't need to change "this_rq" while process LBF_SOME_PINNED any more.
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

78feefc5

sched: Fix __sched_period comment · 532b1858

由 Borislav Petkov 提交于 8月 08, 2012

It should be sched_nr_latency so fix it before it annoys me more.
Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

532b1858

sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies · a35b6466

由 Peter Zijlstra 提交于 8月 08, 2012

Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

  schedule()
    idle_balance()
      load_balance()
        local_irq_save()
        double_rq_lock()
        update_h_load()
          walk_tg_tree(tg_load_down)
            tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Mike Galbraith <efault@gmx.de>
Reported-by: NPeter Portante <pportant@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

a35b6466

31 7月, 2012 1 次提交

sched/cleanups: Add load balance cpumask pointer to 'struct lb_env' · b9403130

由 Michael Wang 提交于 7月 12, 2012

With this patch struct ld_env will have a pointer of the load balancing
cpumask and we don't need to pass a cpumask around anymore.
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FFE8665.3080705@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b9403130

24 7月, 2012 4 次提交

sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task · 88b8dac0

由 Srivatsa Vaddagiri 提交于 6月 19, 2012

Current load balance scheme requires only one cpu in a
sched_group (balance_cpu) to look at other peer sched_groups for
imbalance and pull tasks towards itself from a busy cpu. Tasks
thus pulled by balance_cpu could later get picked up by cpus
that are in the same sched_group as that of balance_cpu.

This scheme however fails to pull tasks that are not allowed to
run on balance_cpu (but are allowed to run on other cpus in its
sched_group). That can affect fairness and in some worst case
scenarios cause starvation.

Consider a two core (2 threads/core) system running tasks as
below:

          Core0            Core1
         /     \          /     \
	C0     C1	 C2     C3
        |      |         |      |
        v      v         v      v
	F0     T1        F1     [idle]
			 T2

 F0 = SCHED_FIFO task (pinned to C0)
 F1 = SCHED_FIFO task (pinned to C2)
 T1 = SCHED_OTHER task (pinned to C1)
 T2 = SCHED_OTHER task (pinned to C1 and C2)

F1 could become a cpu hog, which will starve T2 unless C1 pulls
it. Between C0 and C1 however, C0 is required to look for
imbalance between cores, which will fail to pull T2 towards
Core0. T2 will starve eternally in this case. The same scenario
can arise in presence of non-rt tasks as well (say we replace F1
with high irq load).

We tackle this problem by having balance_cpu move pinned tasks
to one of its sibling cpus (where they can run). We first check
if load balance goal can be met by ignoring pinned tasks,
failing which we retry move_tasks() with a new env->dst_cpu.

This patch modifies load balance semantics on who can move load
towards a given cpu in a given sched_domain.

Before this patch, a given_cpu or a ilb_cpu acting on behalf of
an idle given_cpu is responsible for moving load to given_cpu.

With this patch applied, balance_cpu can in addition decide on
moving some load to a given_cpu.

There is a remote possibility that excess load could get moved
as a result of this (balance_cpu and given_cpu/ilb_cpu deciding
*independently* and at *same* time to move some load to a
given_cpu). However we should see less of such conflicting
decisions in practice and moreover subsequent load balance
cycles should correct the excess load moved to given_cpu.
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06CDB.2060605@linux.vnet.ibm.com
[ minor edits ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

88b8dac0

sched: Reset loop counters if all tasks are pinned and we need to redo load balance · bbf18b19

由 Prashanth Nageshappa 提交于 6月 19, 2012

While load balancing, if all tasks on the source runqueue are pinned,
we retry after excluding the corresponding source cpu. However, loop counters
env.loop and env.loop_break are not reset before retrying, which can lead
to failure in moving the tasks. In this patch we reset env.loop and
env.loop_break to their inital values before we retry.
Signed-off-by: NPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06EEF.2090709@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

bbf18b19

sched: Reorder 'struct lb_env' members to reduce its size · 85c1e7da

由 Prashanth Nageshappa 提交于 6月 19, 2012

Members of 'struct lb_env' are not in appropriate order to reuse compiler
added padding on 64bit architectures. In this patch we reorder those struct
members and help reduce the size of the structure from 96 bytes to 80
bytes on 64 bit architectures.
Suggested-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NPrashanth Nageshappa <prashanth@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FE06DDE.7000403@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

85c1e7da

sched: Improve scalability via 'CPU buddies', which withstand random perturbations · 970e1789

由 Mike Galbraith 提交于 6月 12, 2012

Traversing an entire package is not only expensive, it also leads to tasks
bouncing all over a partially idle and possible quite large package.  Fix
that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
to motivate that one other CPU, if it's busy, tough, it may then try its
SMT sibling, but that's all this optimization is allowed to cost.

Sibling cache buddies are cross-wired to prevent bouncing.

4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

 clients     1       2       4        8       16       32       64      128
 ..........................................................................
 pre        30      41     118      645     3769     6214    12233    14312
 post      299     603    1211     2418     4697     6847    11606    14557

A nice increase in performance.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

970e1789

09 6月, 2012 1 次提交

sched/fair: fix lots of kernel-doc warnings · cd96891d

由 Randy Dunlap 提交于 6月 08, 2012

Fix lots of new kernel-doc warnings in kernel/sched/fair.c:

  Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
  Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
  Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
  .. more warnings
Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cd96891d

06 6月, 2012 2 次提交

sched: Always initialize cpu-power · c3decf0d

由 Peter Zijlstra 提交于 5月 31, 2012

Often when we run into mis-shapen topologies the balance iteration
fails to update the cpu power properly and we'll end up in /0 traps.

Always initialize the cpu-power to a semi-sane value so that we can
at least boot the machine, even if the load-balancer might not
function correctly.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3lbhyj25sr169ha7z3qht5na@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c3decf0d

sched: Fix domain iteration · c1174876

由 Peter Zijlstra 提交于 5月 31, 2012

Weird topologies can lead to asymmetric domain setups. This needs
further consideration since these setups are typically non-minimal
too.

For now, make it work by adding an extra mask selecting which CPUs
are allowed to iterate up.

The topology that triggered it is the one from David Rientjes:

	10 20 20 30
	20 10 20 20
	20 20 10 20
	30 20 20 10

resulting in boxes that wouldn't even boot.
Reported-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c1174876

30 5月, 2012 3 次提交

sched: Move nr_cpus_allowed out of 'struct sched_rt_entity' · 29baa747

由 Peter Zijlstra 提交于 4月 23, 2012

Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
used outside of there more, move it to a more natural site.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

29baa747

sched: Make sure to not re-read variables after validation · b654f7de

由 Peter Zijlstra 提交于 5月 22, 2012

We could re-read rq->rt_avg after we validated it was smaller than
total, invalidating the check and resulting in an unintended negative.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337688268.9698.29.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

b654f7de

sched: Fix SD_OVERLAP · 74a5ce20

由 Peter Zijlstra 提交于 5月 23, 2012

SD_OVERLAP exists to allow overlapping groups, overlapping groups
appear in NUMA topologies that aren't fully connected.

The typical result of not fully connected NUMA is that each cpu (or
rather node) will have different spans for a particular distance.
However due to how sched domains are traversed -- only the first cpu
in the mask goes one level up -- the next level only cares about the
spans of the cpus that went up.

Due to this two things were observed to be broken:

 - build_overlap_sched_groups() -- since its possible the cpu we're
   building the groups for exists in multiple (or all) groups, the
   selection criteria of the first group didn't ensure there was a cpu
   for which is was true that cpumask_first(span) == cpu. Thus load-
   balancing would terminate.

 - update_group_power() -- assumed that the cpu span of the first
   group of the domain was covered by all groups of the child domain.
   The above explains why this isn't true, so deal with it.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptopSigned-off-by: NIngo Molnar <mingo@kernel.org>

74a5ce20

17 5月, 2012 1 次提交

sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc

由 Peter Zijlstra 提交于 1月 09, 2012

It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.

There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.

Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.

So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.

There is a proposal to replace the user interface with a single
3 state knob:

 sched_balance_policy := { performance, power, auto }

where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.

Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.

Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>

8e7fbcbc

14 5月, 2012 3 次提交

sched/fair: Improve the ->group_imb logic · e44bc5c5

由 Peter Zijlstra 提交于 5月 11, 2012

Group imbalance is meant to deal with situations where affinity masks
and sched domains don't align well, such as 3 cpus from one group and
6 from another. In this case the domain based balancer will want to
put an equal amount of tasks on each side even though they don't have
equal cpus.

Currently group_imb is set whenever two cpus of a group have a weight
difference of at least one avg task and the heaviest cpu has at least
two tasks. A group with imbalance set will always be picked as busiest
and a balance pass will be forced.

The problem is that even if there are no affinity masks this stuff can
trigger and cause weird balancing decisions, eg. the observed
behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
difference of 1 avg load (they all had the same weight) and nr_running
being >1 the group_imbalance logic triggered and did the weird thing
of pulling more load instead of trying to move the 1 excess task to
the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
1 task.

Curb the group_imbalance stuff by making the nr_running condition
weaker by also tracking the min_nr_running and using the difference in
nr_running over the set instead of the absolute max nr_running.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

e44bc5c5

sched/nohz: Fix rq->cpu_load[] calculations · 556061b0

由 Peter Zijlstra 提交于 5月 11, 2012

While investigating why the load-balancer did funny I found that the
rq->cpu_load[] tables were completely screwy.. a bit more digging
revealed that the updates that got through were missing ticks followed
by a catchup of 2 ticks.

The catchup assumes the cpu was idle during that time (since only nohz
can cause missed ticks and the machine is idle etc..) this means that
esp. the higher indices were significantly lower than they ought to
be.

The reason for this is that its not correct to compare against jiffies
on every jiffy on any other cpu than the cpu that updates jiffies.

This patch cludges around it by only doing the catch-up stuff from
nohz_idle_balance() and doing the regular stuff unconditionally from
the tick.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Cc: Venkatesh Pallipadi <venki@google.com>
Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

556061b0

sched/fair: Revert sched-domain iteration breakage · 04f733b4

由 Peter Zijlstra 提交于 5月 11, 2012

Patches c22402a2 ("sched/fair: Let minimally loaded cpu balance the
group") and 0ce90475 ("sched/fair: Add some serialization to the
sched_domain load-balance walk") are horribly broken so revert them.

The problem is that while it sounds good to have the minimally loaded
cpu do the pulling of more load, the way we walk the domains there is
absolutely no guarantee this cpu will actually get to the domain. In
fact its very likely it wont. Therefore the higher up the tree we get,
the less likely it is we'll balance at all.

The first of mask always walks up, while sucky in that it accumulates
load on the first cpu and needs extra passes to spread it out at least
guarantees a cpu gets up that far and load-balancing happens at all.

Since its now always the first and idle cpus should always be able to
balance so they get a task as fast as possible we can also do away
with the added serialization.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

04f733b4

09 5月, 2012 4 次提交

sched/fair: Propagate 'struct lb_env' usage into find_busiest_group · bd939f45

由 Peter Zijlstra 提交于 5月 02, 2012

More function argument passing reduction.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v66ivjfqdiqdso01lqgqx6qf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

bd939f45

sched/fair: Add some serialization to the sched_domain load-balance walk · 0ce90475

由 Peter Zijlstra 提交于 4月 25, 2012

Since the sched_domain walk is completely unserialized (!SD_SERIALIZE)
it is possible that multiple cpus in the group get elected to do the
next level. Avoid this by adding some serialization.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

0ce90475

sched/fair: Let minimally loaded cpu balance the group · c22402a2

由 Peter Zijlstra 提交于 4月 20, 2012

Currently we let the leftmost (or first idle) cpu ascend the
sched_domain tree and perform load-balancing. The result is that the
busiest cpu in the group might be performing this function and pull
more load to itself. The next load balance pass will then try to
equalize this again.

Change this to pick the least loaded cpu to perform higher domain
balancing.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v8zlrmgmkne3bkcy9dej1fvm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c22402a2

sched: Change rq->nr_running to unsigned int · c82513e5

由 Peter Zijlstra 提交于 4月 26, 2012

Since there's a PID space limit of 30bits (see
futex.h:FUTEX_TID_MASK) and allocating that many tasks (assuming a
lower bound of 2 pages per task) would still take 8T of memory it
seems reasonable to say that unsigned int is sufficient for
rq->nr_running.

When we do get anywhere near that amount of tasks I suspect other
things would go funny, load-balancer load computations would really
need to be hoisted to 128bit etc.

So save a few bytes and convert rq->nr_running and friends to
unsigned int.
Suggested-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-y3tvyszjdmbibade5bw8zl81@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c82513e5

26 4月, 2012 1 次提交

sched: Fix more load-balancing fallout · eb95308e

由 Peter Zijlstra 提交于 4月 17, 2012

Commits 367456c7 ("sched: Ditch per cgroup task lists for
load-balancing") and 5d6523eb ("sched: Fix load-balance wreckage")
left some more wreckage.

By setting loop_max unconditionally to ->nr_running load-balancing
could take a lot of time on very long runqueues (hackbench!). So keep
the sysctl as max limit of the amount of tasks we'll iterate.

Furthermore, the min load filter for migration completely fails with
cgroups since inequality in per-cpu state can easily lead to such
small loads :/

Furthermore the change to add new tasks to the tail of the queue
instead of the head seems to have some effect.. not quite sure I
understand why.

Combined these fixes solve the huge hackbench regression reported by
Tim when hackbench is ran in a cgroup.
Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
[ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>

eb95308e

23 3月, 2012 1 次提交

sched: Fix compiler warning about declared inline after use · 6c16a6dc

由 Peter Zijlstra 提交于 3月 21, 2012

kernel/sched/fair.c:420: warning: 'account_cfs_rq_runtime' declared inline after being called
kernel/sched/fair.c:420: warning: previous declaration of 'account_cfs_rq_runtime' was here
kernel/sched/fair.c:1165: warning: 'return_cfs_rq_runtime' declared inlineafter being called
kernel/sched/fair.c:1165: warning: previous declaration of 'return_cfs_rq_runtime' was here
Reported-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20120321200717.49BB4A024E@akpm.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6c16a6dc

13 3月, 2012 2 次提交

sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer · 554cecaf

由 Diwakar Tundlam 提交于 3月 07, 2012

The 'next_balance' field of 'nohz' idle balancer must be initialized
to jiffies. Since jiffies is initialized to negative 300 seconds the
'nohz' idle balancer does not run for the first 300s (5mins) after
bootup. If no new processes are spawed or no idle cycles happen, the
load on the cpus will remain unbalanced for that duration.
Signed-off-by: NDiwakar Tundlam <dtundlam@nvidia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1DD7BFEDD3147247B1355BEFEFE4665237994F30EF@HQMAIL04.nvidia.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

554cecaf

sched: Fix load-balance wreckage · 5d6523eb

由 Peter Zijlstra 提交于 3月 10, 2012

Commit 367456c7 ("sched: Ditch per cgroup task lists for
load-balancing") completely wrecked load-balancing due to
a few silly mistakes.

Correct those and remove more pointless code.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-zk04ihygwxn7qqrlpaf73b0r@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

5d6523eb

01 3月, 2012 3 次提交

sched: Ditch per cgroup task lists for load-balancing · 367456c7

由 Peter Zijlstra 提交于 2月 20, 2012

Per cgroup load-balance has numerous problems, chief amongst them that
there is no real sane order in them. So stop pretending it makes sense
and enqueue all tasks on a single list.

This also allows us to more easily fix the fwd progress issue
uncovered by the lock-break stuff. Rotate the list on failure to
migreate and limit the total iterations to nr_running (which with
releasing the lock isn't strictly accurate but close enough).

Also add a filter that skips very light tasks on the first attempt
around the list, this attempts to avoid shooting whole cgroups around
without affecting over balance.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

367456c7

sched: Rename load-balancing fields · ddcdf6e7

由 Peter Zijlstra 提交于 2月 22, 2012

 s/env->this_/env->dst_/g
 s/env->busiest_/env->src_/g
 s/pull_task/move_task/g

Makes everything clearer.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-0yvgms8t8x962drpvl0fu0kk@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

ddcdf6e7

sched: Move load-balancing arguments into helper struct · 8e45cb54

由 Peter Zijlstra 提交于 2月 22, 2012

Passing large sets of similar arguments all around the load-balancer
gets tiresom when you want to modify something. Stick them all in a
helper structure and pass the structure around.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: pjt@google.com
Link: http://lkml.kernel.org/n/tip-5slqz0vhsdzewrfk9eza1aon@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

8e45cb54

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功