1. 11 12月, 2012 5 次提交
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • M
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman 提交于
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      9f40604c
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
  2. 17 9月, 2012 1 次提交
  3. 13 9月, 2012 2 次提交
  4. 04 9月, 2012 2 次提交
  5. 14 8月, 2012 4 次提交
  6. 31 7月, 2012 1 次提交
  7. 24 7月, 2012 4 次提交
  8. 09 6月, 2012 1 次提交
    • R
      sched/fair: fix lots of kernel-doc warnings · cd96891d
      Randy Dunlap 提交于
      Fix lots of new kernel-doc warnings in kernel/sched/fair.c:
      
        Warning(kernel/sched/fair.c:3625): No description found for parameter 'env'
        Warning(kernel/sched/fair.c:3625): Excess function parameter 'sd' description in 'update_sg_lb_stats'
        Warning(kernel/sched/fair.c:3735): No description found for parameter 'env'
        Warning(kernel/sched/fair.c:3735): Excess function parameter 'sd' description in 'update_sd_pick_busiest'
        Warning(kernel/sched/fair.c:3735): Excess function parameter 'this_cpu' description in 'update_sd_pick_busiest'
        .. more warnings
      Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd96891d
  9. 06 6月, 2012 2 次提交
  10. 30 5月, 2012 3 次提交
  11. 17 5月, 2012 1 次提交
    • P
      sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc
      Peter Zijlstra 提交于
      It's been broken forever (i.e. it's not scheduling in a power
      aware fashion), as reported by Suresh and others sending
      patches, and nobody cares enough to fix it properly ...
      so remove it to make space free for something better.
      
      There's various problems with the code as it stands today, first
      and foremost the user interface which is bound to topology
      levels and has multiple values per level. This results in a
      state explosion which the administrator or distro needs to
      master and almost nobody does.
      
      Furthermore large configuration state spaces aren't good, it
      means the thing doesn't just work right because it's either
      under so many impossibe to meet constraints, or even if
      there's an achievable state workloads have to be aware of
      it precisely and can never meet it for dynamic workloads.
      
      So pushing this kind of decision to user-space was a bad idea
      even with a single knob - it's exponentially worse with knobs
      on every node of the topology.
      
      There is a proposal to replace the user interface with a single
      3 state knob:
      
       sched_balance_policy := { performance, power, auto }
      
      where 'auto' would be the preferred default which looks at things
      like Battery/AC mode and possible cpufreq state or whatever the hw
      exposes to show us power use expectations - but there's been no
      progress on it in the past many months.
      
      Aside from that, the actual implementation of the various knobs
      is known to be broken. There have been sporadic attempts at
      fixing things but these always stop short of reaching a mergable
      state.
      
      Therefore this wholesale removal with the hopes of spurring
      people who care to come forward once again and work on a
      coherent replacement.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e7fbcbc
  12. 14 5月, 2012 3 次提交
    • P
      sched/fair: Improve the ->group_imb logic · e44bc5c5
      Peter Zijlstra 提交于
      Group imbalance is meant to deal with situations where affinity masks
      and sched domains don't align well, such as 3 cpus from one group and
      6 from another. In this case the domain based balancer will want to
      put an equal amount of tasks on each side even though they don't have
      equal cpus.
      
      Currently group_imb is set whenever two cpus of a group have a weight
      difference of at least one avg task and the heaviest cpu has at least
      two tasks. A group with imbalance set will always be picked as busiest
      and a balance pass will be forced.
      
      The problem is that even if there are no affinity masks this stuff can
      trigger and cause weird balancing decisions, eg. the observed
      behaviour was that of 6 cpus, 5 had 2 and 1 had 3 tasks, due to the
      difference of 1 avg load (they all had the same weight) and nr_running
      being >1 the group_imbalance logic triggered and did the weird thing
      of pulling more load instead of trying to move the 1 excess task to
      the other domain of 6 cpus that had 5 cpu with 2 tasks and 1 cpu with
      1 task.
      
      Curb the group_imbalance stuff by making the nr_running condition
      weaker by also tracking the min_nr_running and using the difference in
      nr_running over the set instead of the absolute max nr_running.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-9s7dedozxo8kjsb9kqlrukkf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e44bc5c5
    • P
      sched/nohz: Fix rq->cpu_load[] calculations · 556061b0
      Peter Zijlstra 提交于
      While investigating why the load-balancer did funny I found that the
      rq->cpu_load[] tables were completely screwy.. a bit more digging
      revealed that the updates that got through were missing ticks followed
      by a catchup of 2 ticks.
      
      The catchup assumes the cpu was idle during that time (since only nohz
      can cause missed ticks and the machine is idle etc..) this means that
      esp. the higher indices were significantly lower than they ought to
      be.
      
      The reason for this is that its not correct to compare against jiffies
      on every jiffy on any other cpu than the cpu that updates jiffies.
      
      This patch cludges around it by only doing the catch-up stuff from
      nohz_idle_balance() and doing the regular stuff unconditionally from
      the tick.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: pjt@google.com
      Cc: Venkatesh Pallipadi <venki@google.com>
      Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      556061b0
    • P
      sched/fair: Revert sched-domain iteration breakage · 04f733b4
      Peter Zijlstra 提交于
      Patches c22402a2 ("sched/fair: Let minimally loaded cpu balance the
      group") and 0ce90475 ("sched/fair: Add some serialization to the
      sched_domain load-balance walk") are horribly broken so revert them.
      
      The problem is that while it sounds good to have the minimally loaded
      cpu do the pulling of more load, the way we walk the domains there is
      absolutely no guarantee this cpu will actually get to the domain. In
      fact its very likely it wont. Therefore the higher up the tree we get,
      the less likely it is we'll balance at all.
      
      The first of mask always walks up, while sucky in that it accumulates
      load on the first cpu and needs extra passes to spread it out at least
      guarantees a cpu gets up that far and load-balancing happens at all.
      
      Since its now always the first and idle cpus should always be able to
      balance so they get a task as fast as possible we can also do away
      with the added serialization.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      04f733b4
  13. 09 5月, 2012 4 次提交
  14. 26 4月, 2012 1 次提交
  15. 23 3月, 2012 1 次提交
  16. 13 3月, 2012 2 次提交
  17. 01 3月, 2012 3 次提交