1. 17 5月, 2012 1 次提交
    • P
      sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc
      Peter Zijlstra 提交于
      It's been broken forever (i.e. it's not scheduling in a power
      aware fashion), as reported by Suresh and others sending
      patches, and nobody cares enough to fix it properly ...
      so remove it to make space free for something better.
      
      There's various problems with the code as it stands today, first
      and foremost the user interface which is bound to topology
      levels and has multiple values per level. This results in a
      state explosion which the administrator or distro needs to
      master and almost nobody does.
      
      Furthermore large configuration state spaces aren't good, it
      means the thing doesn't just work right because it's either
      under so many impossibe to meet constraints, or even if
      there's an achievable state workloads have to be aware of
      it precisely and can never meet it for dynamic workloads.
      
      So pushing this kind of decision to user-space was a bad idea
      even with a single knob - it's exponentially worse with knobs
      on every node of the topology.
      
      There is a proposal to replace the user interface with a single
      3 state knob:
      
       sched_balance_policy := { performance, power, auto }
      
      where 'auto' would be the preferred default which looks at things
      like Battery/AC mode and possible cpufreq state or whatever the hw
      exposes to show us power use expectations - but there's been no
      progress on it in the past many months.
      
      Aside from that, the actual implementation of the various knobs
      is known to be broken. There have been sporadic attempts at
      fixing things but these always stop short of reaching a mergable
      state.
      
      Therefore this wholesale removal with the hopes of spurring
      people who care to come forward once again and work on a
      coherent replacement.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e7fbcbc
  2. 09 5月, 2012 1 次提交
    • P
      sched/numa: Rewrite the CONFIG_NUMA sched domain support · cb83b629
      Peter Zijlstra 提交于
      The current code groups up to 16 nodes in a level and then puts an
      ALLNODES domain spanning the entire tree on top of that. This doesn't
      reflect the numa topology and esp for the smaller not-fully-connected
      machines out there today this might make a difference.
      
      Therefore, build a proper numa topology based on node_distance().
      
      Since there's no fixed numa layers anymore, the static SD_NODE_INIT
      and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
      construct something similar and scales some values either on the
      number of cpus in the domain and/or the node_distance() ratio.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-sh@vger.kernel.org
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: sparclinux@vger.kernel.org
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86@kernel.org
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Greg Pearson <greg.pearson@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: bob.picco@oracle.com
      Cc: chris.mason@oracle.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb83b629
  3. 20 9月, 2011 1 次提交
  4. 16 6月, 2011 1 次提交
    • K
      mm: increase RECLAIM_DISTANCE to 30 · 32e45ff4
      KOSAKI Motohiro 提交于
      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
      a very traditional single-process model.
      
        * a master process which reads config files and manages the other
          process
        * multiple imapd processes, one per connection
        * multiple pop3d processes, one per connection
        * multiple lmtpd processes, one per connection
        * periodical "cleanup" processes.
      
      There are thousands of independent processes.  The problem is, recent
      Intel motherboard turn on zone_reclaim_mode by default and traditional
      prefork model software don't work well on it.  Unfortunatelly, such models
      are still typical even in the 21st century.  We can't ignore them.
      
      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
      such relatively cheap 2-4 socket machine are often used for traditional
      servers as above.  The intention is that these machines don't use
      zone_reclaim_mode.
      
      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
      This patch doesn't change such high-end NUMA machine behavior.
      
      Dave Hansen said:
      
      : I know specifically of pieces of x86 hardware that set the information
      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
      : behavior which that implies.
      :
      : They've done performance testing and run very large and scary benchmarks
      : to make sure that they _want_ this turned on.  What this means for them
      : is that they'll probably be de-optimized, at least on newer versions of
      : the kernel.
      :
      : If you want to do this for particular systems, maybe _that_'s what we
      : should do.  Have a list of specific configurations that need the
      : defaults overridden either because they're buggy, or they have an
      : unusual hardware configuration not really reflected in the distance
      : table.
      
      And later said:
      
      : The original change in the hardware tables was for the benefit of a
      : benchmark.  Said benchmark isn't going to get run on mainline until the
      : next batch of enterprise distros drops, at which point the hardware where
      : this was done will be irrelevant for the benchmark.  I'm sure any new
      : hardware will just set this distance to another yet arbitrary value to
      : make the kernel do what it wants.  :)
      :
      : Also, when the hardware got _set_ to this initially, I complained.  So, I
      : guess I'm getting my way now, with this patch.  I'm cool with it.
      Reported-by: NRobert Mueller <robm@fastmail.fm>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e45ff4
  5. 10 9月, 2010 1 次提交
  6. 10 8月, 2010 1 次提交
  7. 29 6月, 2010 1 次提交
  8. 09 6月, 2010 1 次提交
    • M
      sched: Add asymmetric group packing option for sibling domain · 532cb4c4
      Michael Neuling 提交于
      Check to see if the group is packed in a sched doman.
      
      This is primarily intended to used at the sibling level.  Some cores
      like POWER7 prefer to use lower numbered SMT threads.  In the case of
      POWER7, it can move to lower SMT modes only when higher threads are
      idle.  When in lower SMT modes, the threads will perform better since
      they share less core resources.  Hence when we have idle threads, we
      want them to be the higher ones.
      
      This adds a hook into f_b_g() called check_asym_packing() to check the
      packing.  This packing function is run on idle threads.  It checks to
      see if the busiest CPU in this domain (core in the P7 case) has a
      higher CPU number than what where the packing function is being run
      on.  If it is, calculate the imbalance and return the higher busier
      thread as the busiest group to f_b_g().  Here we are assuming a lower
      CPU number will be equivalent to a lower SMT thread number.
      
      It also creates a new SD_ASYM_PACKING flag to enable this feature at
      any scheduler domain level.
      
      It also creates an arch hook to enable this feature at the sibling
      level.  The default function doesn't enable this feature.
      
      Based heavily on patch from Peter Zijlstra.
      Fixes from Srivatsa Vaddagiri.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      532cb4c4
  9. 28 5月, 2010 2 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  10. 21 1月, 2010 1 次提交
  11. 14 10月, 2009 1 次提交
  12. 24 9月, 2009 1 次提交
  13. 16 9月, 2009 2 次提交
    • P
      sched: Disable wakeup balancing · 182a85f8
      Peter Zijlstra 提交于
      Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't
      really mind too much, SD_BALANCE_NEWIDLE picks up most of the
      slack.
      
      On a dual socket, quad core, dual thread nehalem system:
      
      sysbench (--num_threads=16):
      
       SD_BALANCE_WAKE-: 13982 tx/s
       SD_BALANCE_WAKE+: 15688 tx/s
      
      kbuild (-j16):
      
       SD_BALANCE_WAKE-: 47.648295846  seconds time elapsed   ( +-   0.312% )
       SD_BALANCE_WAKE+: 47.608607360  seconds time elapsed   ( +-   0.026% )
      
      (same within noise)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      182a85f8
    • P
      sched: Add SD_PREFER_LOCAL · 59abf026
      Peter Zijlstra 提交于
      And turn it on for NUMA and MC domains. This improves
      locality in balancing decisions by keeping up to
      capacity amount of tasks local before looking for idle
      CPUs. (and twice the capacity if SD_POWERSAVINGS_BALANCE
      is set.)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      59abf026
  14. 15 9月, 2009 5 次提交
    • P
      sched: Reduce forkexec_idx · b8a543ea
      Peter Zijlstra 提交于
      If we're looking to place a new task, we might as well find the
      idlest position _now_, not 1 tick ago.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b8a543ea
    • M
      sched: Improve latencies and throughput · 0ec9fab3
      Mike Galbraith 提交于
      Make the idle balancer more agressive, to improve a
      x264 encoding workload provided by Jason Garrett-Glaser:
      
       NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 252.82 fps, 22096.60 kb/s
       encoded 600 frames, 250.69 fps, 22096.60 kb/s
       encoded 600 frames, 245.76 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY LB_BIAS
       encoded 600 frames, 344.44 fps, 22096.60 kb/s
       encoded 600 frames, 346.66 fps, 22096.60 kb/s
       encoded 600 frames, 352.59 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 425.75 fps, 22096.60 kb/s
       encoded 600 frames, 425.45 fps, 22096.60 kb/s
       encoded 600 frames, 422.49 fps, 22096.60 kb/s
      
      Peter pointed out that this is better done via newidle_idx,
      not via LB_BIAS, newidle balancing should look for where
      there is load _now_, not where there was load 2 ticks ago.
      
      Worst-case latencies are improved as well as no buddies
      means less vruntime spread. (as per prior lkml discussions)
      
      This change improves kbuild-peak parallelism as well.
      Reported-by: NJason Garrett-Glaser <darkshikari@gmail.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1253011667.9128.16.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ec9fab3
    • P
      sched: Fix some domain tunings · 6bd7821f
      Peter Zijlstra 提交于
      CPU level should have WAKE_AFFINE, whereas ALLNODES is dubious.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bd7821f
    • P
      sched: Tweak wake_idx · 78e7ed53
      Peter Zijlstra 提交于
      When merging select_task_rq_fair() and sched_balance_self() we lost
      the use of wake_idx, restore that and set them to 0 to make wake
      balancing more aggressive.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      78e7ed53
    • P
      sched: Merge select_task_rq_fair() and sched_balance_self() · c88d5910
      Peter Zijlstra 提交于
      The problem with wake_idle() is that is doesn't respect things like
      cpu_power, which means it doesn't deal well with SMT nor the recent
      RT interaction.
      
      To cure this, it needs to do what sched_balance_self() does, which
      leads to the possibility of merging select_task_rq_fair() and
      sched_balance_self().
      
      Modify sched_balance_self() to:
      
        - update_shares() when walking up the domain tree,
          (it only called it for the top domain, but it should
           have done this anyway), which allows us to remove
          this ugly bit from try_to_wake_up().
      
        - do wake_affine() on the smallest domain that contains
          both this (the waking) and the prev (the wakee) cpu for
          WAKE invocations.
      
      Then use the top-down balance steps it had to replace wake_idle().
      
      This leads to the dissapearance of SD_WAKE_BALANCE and
      SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
      
      SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
      
      Touch all topology bits to replace the old with new SD flags --
      platforms might need re-tuning, enabling SD_BALANCE_WAKE
      conditionally on a NUMA distance seems like a good additional
      feature, magny-core and small nehalem systems would want this
      enabled, systems with slow interconnects would not.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c88d5910
  15. 08 9月, 2009 1 次提交
  16. 04 9月, 2009 3 次提交
    • I
      sched: Turn on SD_BALANCE_NEWIDLE · 840a0653
      Ingo Molnar 提交于
      Start the re-tuning of the balancer by turning on newidle.
      
      It improves hackbench performance and parallelism on a 4x4 box.
      The "perf stat --repeat 10" measurements give us:
      
        domain0             domain1
        .......................................
       -SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2041.273208  task-clock-msecs         #      9.354 CPUs    ( +-   0.363% )
      
       +SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2086.326925  task-clock-msecs         #     11.934 CPUs    ( +-   0.301% )
      
       +SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
         2115.289791  task-clock-msecs         #     12.158 CPUs    ( +-   0.263% )
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      840a0653
    • I
      sched: Clean up topology.h · 47734f89
      Ingo Molnar 提交于
      Re-organize the flag settings so that it's visible at a glance
      which sched-domains flags are set and which not.
      
      With the new balancer code we'll need to re-tune these details
      anyway, so make it cleaner to make fewer mistakes down the
      road ;-)
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47734f89
    • P
      sched: Add smt_gain · a52bfd73
      Peter Zijlstra 提交于
      The idea is that multi-threading a core yields more work
      capacity than a single thread, provide a way to express a
      static gain for threads.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NGautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <20090901083826.073345955@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a52bfd73
  17. 13 3月, 2009 2 次提交
  18. 12 1月, 2009 1 次提交
  19. 19 12月, 2008 2 次提交
  20. 12 12月, 2008 1 次提交
  21. 07 11月, 2008 2 次提交
  22. 06 11月, 2008 1 次提交
    • I
      sched: re-tune balancing · 9fcd18c9
      Ingo Molnar 提交于
      Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
      
      Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
      balancing defaults on NUMA and SMP systems.
      
      Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
      why we would not want to have wakeup affinity across nodes as well.
      (we already do this in the standard NUMA template.)
      
      lat_ctx on a NUMA box is particularly happy about this change:
      
      before:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.60
       |   2 5.70
      
      after:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.65
       |   2 2.07
      
      a 2.75x speedup.
      
      pipe-test is similarly happy about it too:
      
       |  phoenix:~/sched-tests> ./pipe-test
       |   18.26 usecs/loop.
       |   14.70 usecs/loop.
       |   14.38 usecs/loop.
       |   10.55 usecs/loop.              # +WAKE_AFFINE on domain0+domain1
       |   8.63 usecs/loop.
       |   8.59 usecs/loop.
       |   9.03 usecs/loop.
       |   8.94 usecs/loop.
       |   8.96 usecs/loop.
       |   8.63 usecs/loop.
      
      Also:
      
       - disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
       - enable SD_WAKE_BALANCE on SMP domains
      
      Sysbench+postgresql improves all around the board, quite significantly:
      
                 .28-rc3-11474e2c  .28-rc3-11474e2c-tune
      -------------------------------------------------
          1:             571              688    +17.08%
          2:            1236             1206    -2.55%
          4:            2381             2642    +9.89%
          8:            4958             5164    +3.99%
         16:            9580             9574    -0.07%
         32:            7128             8118    +12.20%
         64:            7342             8266    +11.18%
        128:            7342             8064    +8.95%
        256:            7519             7884    +4.62%
        512:            7350             7731    +4.93%
      -------------------------------------------------
        SUM:           55412            59341    +6.62%
      
      So it's a win both for the runup portion, the peak area and the tail.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9fcd18c9
  23. 13 6月, 2008 1 次提交
    • B
      cpu topology: always define CPU topology information · c50cbb05
      Ben Hutchings 提交于
      This can result in an empty topology directory in sysfs, and requires
      in-kernel users to protect all uses with #ifdef - see
      <http://marc.info/?l=linux-netdev&m=120639033904472&w=2>.
      
      The documentation of CPU topology specifies what the defaults should be if
      only partial information is available from the hardware.  So we can
      provide these defaults as a fallback.
      
      This patch:
      
      - Adds default definitions of the 4 topology macros to <linux/topology.h>
      - Changes drivers/base/topology.c to use the topology macros unconditionally
        and to cope with definitions that aren't lvalues
      - Updates documentation accordingly
      
      [ From: Andrew Morton <akpm@linux-foundation.org>
        - fold now-duplicated code
        - fix layout
      ]
      Signed-off-by: NBen Hutchings <bhutchings@solarflare.com>
      Cc: Vegard Nossum <vegard.nossum@gmail.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Chandra Seetharaman <sekharan@us.ibm.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: John Hawkes <hawkes@sgi.com>
      Cc: Zhang, Yanmin <yanmin.zhang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c50cbb05
  24. 29 5月, 2008 1 次提交
    • I
      sched: re-tune NUMA topologies · ea3f01f8
      Ingo Molnar 提交于
      improve the sysbench ramp-up phase and its peak throughput on
      a 16way NUMA box, by turning on WAKE_AFFINE:
      
                   tip/sched   tip/sched+wake-affine
      -------------------------------------------------
          1:             700              830    +15.65%
          2:            1465             1391    -5.28%
          4:            3017             3105    +2.81%
          8:            5100             6021    +15.30%
         16:           10725            10745    +0.19%
         32:           10135            10150    +0.16%
         64:            9338             9240    -1.06%
        128:            8599             8252    -4.21%
        256:            8475             8144    -4.07%
      -------------------------------------------------
        SUM:           57558            57882    +0.56%
      
      this change also improves lat_ctx from 6.69 usecs to 1.11 usec:
      
        $ ./lat_ctx -s 0 2
        "size=0k ovr=1.19
        2 1.11
      
        $ ./lat_ctx -s 0 2
        "size=0k ovr=1.22
        2 6.69
      
      in sysbench it's an overall win with some weakness at the lots-of-clients
      side. That happens because we now under-balance this workload
      a bit. To counter that effect, turn on NEWIDLE:
      
                    wake-idle          wake-idle+newidle
       -------------------------------------------------
           1:             830              834    +0.43%
           2:            1391             1401    +0.65%
           4:            3105             3091    -0.43%
           8:            6021             6046    +0.42%
          16:           10745            10736    -0.08%
          32:           10150            10206    +0.55%
          64:            9240             9533    +3.08%
         128:            8252             8355    +1.24%
         256:            8144             8384    +2.87%
       -------------------------------------------------
         SUM:           57882            58591    +1.21%
      
      as a bonus this not only improves the many-clients case but
      also improves the (more important) rampup phase.
      
      sysbench is a workload that quickly breaks down if the
      scheduler over-balances, so since it showed an improvement
      under NEWIDLE this change is definitely good.
      ea3f01f8
  25. 20 4月, 2008 1 次提交
    • M
      cpumask: reduce stack usage in SD_x_INIT initializers · 7c16ec58
      Mike Travis 提交于
        * Remove empty cpumask_t (and all non-zero/non-null) variables
          in SD_*_INIT macros.  Use memset(0) to clear.  Also, don't
          inline the initializer functions to save on stack space in
          build_sched_domains().
      
        * Merge change to include/linux/topology.h that uses the new
          node_to_cpumask_ptr function in the nr_cpus_node macro into
          this patch.
      
      Depends on:
      	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
      	[sched-devel]: sched: add new set_cpus_allowed_ptr function
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7c16ec58
  26. 21 3月, 2008 1 次提交
  27. 19 3月, 2008 1 次提交
  28. 26 1月, 2008 2 次提交