1. 07 5月, 2014 1 次提交
    • V
      sched: Rework sched_domain topology definition · 143e1e28
      Vincent Guittot 提交于
      We replace the old way to configure the scheduler topology with a new method
      which enables a platform to declare additionnal level (if needed).
      
      We still have a default topology table definition that can be used by platform
      that don't want more level than the SMT, MC, CPU and NUMA ones. This table can
      be overwritten by an arch which either wants to add new level where a load
      balance make sense like BOOK or powergating level or wants to change the flags
      configuration of some levels.
      
      For each level, we need a function pointer that returns cpumask for each cpu,
      a function pointer that returns the flags for the level and a name. Only flags
      that describe topology, can be set by an architecture. The current topology
      flags are:
      
       SD_SHARE_CPUPOWER
       SD_SHARE_PKG_RESOURCES
       SD_NUMA
       SD_ASYM_PACKING
      
      Then, each level must be a subset on the next one. The build sequence of the
      sched_domain will take care of removing useless levels like those with 1 CPU
      and those with the same CPU span and no more relevant information for
      load balancing than its children.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hanjun Guo <hanjun.guo@linaro.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux390@de.ibm.com
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-s390@vger.kernel.org
      Link: http://lkml.kernel.org/r/1397209481-28542-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      143e1e28
  2. 08 4月, 2014 1 次提交
    • C
      mm: use raw_cpu ops for determining current NUMA node · dc322a99
      Christoph Lameter 提交于
      With the preempt checking logic for __this_cpu_ops we will get false
      positives from locations in the code that use numa_node_id.
      
      Before the __this_cpu ops where introduced there were no checks for
      preemption present either.  smp_raw_processor_id() was used.  See
      
        http://www.spinics.net/lists/linux-numa/msg00641.html
      
      Therefore we need to use raw_cpu_read here to avoid false postives.
      
      Note that this issue has been discussed in prior years.  If the process
      changes nodes after retrieving the current numa node then that is
      acceptable since most uses of numa_node etc are for optimization and not
      for correctness.
      
      There were suggestions to implement a raw_numa_node_id in order to do
      preempt checks for numa_node_id as well.  But I think we better defer
      that to another patch since that would mean investigating how
      numa_node_id() is used throughout the kernel which would increase the
      scope of this patchset significantly.  After all preemption was never
      checked before when numa_node_id() was used.
      
      Some sample traces:
      
      __this_cpu_read operation in preemptible [00000000] code: login/1456
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        get_task_policy+0x1d/0x49
        get_vma_policy+0x14/0x76
        alloc_pages_vma+0x35/0xff
        handle_mm_fault+0x290/0x73b
        __do_page_fault+0x3fe/0x44d
        do_page_fault+0x9/0xc
        page_fault+0x22/0x30
        generic_file_aio_read+0x38e/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      
      caller is __this_cpu_preempt_check+0x2b/0x2d
      CPU: 0 PID: 1456 Comm: login Not tainted 3.12.0-rc4-cl-00062-g2fe80d3b-dirty #185
      Call Trace:
        dump_stack+0x4e/0x82
        check_preemption_disabled+0xc5/0xe0
        __this_cpu_preempt_check+0x2b/0x2d
        alloc_pages_current+0x8f/0xbc
        __page_cache_alloc+0xb/0xd
        __do_page_cache_readahead+0xf4/0x219
        ra_submit+0x1c/0x20
        ondemand_readahead+0x28c/0x2b4
        page_cache_sync_readahead+0x38/0x3a
        generic_file_aio_read+0x261/0x624
        do_sync_read+0x54/0x73
        vfs_read+0x9d/0x12a
        SyS_read+0x47/0x7e
        cstar_dispatch+0x7/0x23
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc322a99
  3. 20 9月, 2013 2 次提交
  4. 14 8月, 2012 1 次提交
  5. 26 7月, 2012 1 次提交
  6. 17 5月, 2012 1 次提交
    • P
      sched: Remove stale power aware scheduling remnants and dysfunctional knobs · 8e7fbcbc
      Peter Zijlstra 提交于
      It's been broken forever (i.e. it's not scheduling in a power
      aware fashion), as reported by Suresh and others sending
      patches, and nobody cares enough to fix it properly ...
      so remove it to make space free for something better.
      
      There's various problems with the code as it stands today, first
      and foremost the user interface which is bound to topology
      levels and has multiple values per level. This results in a
      state explosion which the administrator or distro needs to
      master and almost nobody does.
      
      Furthermore large configuration state spaces aren't good, it
      means the thing doesn't just work right because it's either
      under so many impossibe to meet constraints, or even if
      there's an achievable state workloads have to be aware of
      it precisely and can never meet it for dynamic workloads.
      
      So pushing this kind of decision to user-space was a bad idea
      even with a single knob - it's exponentially worse with knobs
      on every node of the topology.
      
      There is a proposal to replace the user interface with a single
      3 state knob:
      
       sched_balance_policy := { performance, power, auto }
      
      where 'auto' would be the preferred default which looks at things
      like Battery/AC mode and possible cpufreq state or whatever the hw
      exposes to show us power use expectations - but there's been no
      progress on it in the past many months.
      
      Aside from that, the actual implementation of the various knobs
      is known to be broken. There have been sporadic attempts at
      fixing things but these always stop short of reaching a mergable
      state.
      
      Therefore this wholesale removal with the hopes of spurring
      people who care to come forward once again and work on a
      coherent replacement.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e7fbcbc
  7. 15 5月, 2012 1 次提交
  8. 09 5月, 2012 1 次提交
    • P
      sched/numa: Rewrite the CONFIG_NUMA sched domain support · cb83b629
      Peter Zijlstra 提交于
      The current code groups up to 16 nodes in a level and then puts an
      ALLNODES domain spanning the entire tree on top of that. This doesn't
      reflect the numa topology and esp for the smaller not-fully-connected
      machines out there today this might make a difference.
      
      Therefore, build a proper numa topology based on node_distance().
      
      Since there's no fixed numa layers anymore, the static SD_NODE_INIT
      and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
      construct something similar and scales some values either on the
      number of cpus in the domain and/or the node_distance() ratio.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-sh@vger.kernel.org
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: sparclinux@vger.kernel.org
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86@kernel.org
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Greg Pearson <greg.pearson@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: bob.picco@oracle.com
      Cc: chris.mason@oracle.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb83b629
  9. 20 9月, 2011 1 次提交
  10. 16 6月, 2011 1 次提交
    • K
      mm: increase RECLAIM_DISTANCE to 30 · 32e45ff4
      KOSAKI Motohiro 提交于
      Recently, Robert Mueller reported (http://lkml.org/lkml/2010/9/12/236)
      that zone_reclaim_mode doesn't work properly on his new NUMA server (Dual
      Xeon E5520 + Intel S5520UR MB).  He is using Cyrus IMAPd and it's built on
      a very traditional single-process model.
      
        * a master process which reads config files and manages the other
          process
        * multiple imapd processes, one per connection
        * multiple pop3d processes, one per connection
        * multiple lmtpd processes, one per connection
        * periodical "cleanup" processes.
      
      There are thousands of independent processes.  The problem is, recent
      Intel motherboard turn on zone_reclaim_mode by default and traditional
      prefork model software don't work well on it.  Unfortunatelly, such models
      are still typical even in the 21st century.  We can't ignore them.
      
      This patch raises the zone_reclaim_mode threshold to 30.  30 doesn't have
      any specific meaning.  but 20 means that one-hop QPI/Hypertransport and
      such relatively cheap 2-4 socket machine are often used for traditional
      servers as above.  The intention is that these machines don't use
      zone_reclaim_mode.
      
      Note: ia64 and Power have arch specific RECLAIM_DISTANCE definitions.
      This patch doesn't change such high-end NUMA machine behavior.
      
      Dave Hansen said:
      
      : I know specifically of pieces of x86 hardware that set the information
      : in the BIOS to '21' *specifically* so they'll get the zone_reclaim_mode
      : behavior which that implies.
      :
      : They've done performance testing and run very large and scary benchmarks
      : to make sure that they _want_ this turned on.  What this means for them
      : is that they'll probably be de-optimized, at least on newer versions of
      : the kernel.
      :
      : If you want to do this for particular systems, maybe _that_'s what we
      : should do.  Have a list of specific configurations that need the
      : defaults overridden either because they're buggy, or they have an
      : unusual hardware configuration not really reflected in the distance
      : table.
      
      And later said:
      
      : The original change in the hardware tables was for the benefit of a
      : benchmark.  Said benchmark isn't going to get run on mainline until the
      : next batch of enterprise distros drops, at which point the hardware where
      : this was done will be irrelevant for the benchmark.  I'm sure any new
      : hardware will just set this distance to another yet arbitrary value to
      : make the kernel do what it wants.  :)
      :
      : Also, when the hardware got _set_ to this initially, I complained.  So, I
      : guess I'm getting my way now, with this patch.  I'm cool with it.
      Reported-by: NRobert Mueller <robm@fastmail.fm>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32e45ff4
  11. 10 9月, 2010 1 次提交
  12. 10 8月, 2010 1 次提交
  13. 29 6月, 2010 1 次提交
  14. 09 6月, 2010 1 次提交
    • M
      sched: Add asymmetric group packing option for sibling domain · 532cb4c4
      Michael Neuling 提交于
      Check to see if the group is packed in a sched doman.
      
      This is primarily intended to used at the sibling level.  Some cores
      like POWER7 prefer to use lower numbered SMT threads.  In the case of
      POWER7, it can move to lower SMT modes only when higher threads are
      idle.  When in lower SMT modes, the threads will perform better since
      they share less core resources.  Hence when we have idle threads, we
      want them to be the higher ones.
      
      This adds a hook into f_b_g() called check_asym_packing() to check the
      packing.  This packing function is run on idle threads.  It checks to
      see if the busiest CPU in this domain (core in the P7 case) has a
      higher CPU number than what where the packing function is being run
      on.  If it is, calculate the imbalance and return the higher busier
      thread as the busiest group to f_b_g().  Here we are assuming a lower
      CPU number will be equivalent to a lower SMT thread number.
      
      It also creates a new SD_ASYM_PACKING flag to enable this feature at
      any scheduler domain level.
      
      It also creates an arch hook to enable this feature at the sibling
      level.  The default function doesn't enable this feature.
      
      Based heavily on patch from Peter Zijlstra.
      Fixes from Srivatsa Vaddagiri.
      Signed-off-by: NMichael Neuling <mikey@neuling.org>
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <20100608045702.2936CCC897@localhost.localdomain>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      532cb4c4
  15. 28 5月, 2010 2 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  16. 21 1月, 2010 1 次提交
  17. 14 10月, 2009 1 次提交
  18. 24 9月, 2009 1 次提交
  19. 16 9月, 2009 2 次提交
    • P
      sched: Disable wakeup balancing · 182a85f8
      Peter Zijlstra 提交于
      Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't
      really mind too much, SD_BALANCE_NEWIDLE picks up most of the
      slack.
      
      On a dual socket, quad core, dual thread nehalem system:
      
      sysbench (--num_threads=16):
      
       SD_BALANCE_WAKE-: 13982 tx/s
       SD_BALANCE_WAKE+: 15688 tx/s
      
      kbuild (-j16):
      
       SD_BALANCE_WAKE-: 47.648295846  seconds time elapsed   ( +-   0.312% )
       SD_BALANCE_WAKE+: 47.608607360  seconds time elapsed   ( +-   0.026% )
      
      (same within noise)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      182a85f8
    • P
      sched: Add SD_PREFER_LOCAL · 59abf026
      Peter Zijlstra 提交于
      And turn it on for NUMA and MC domains. This improves
      locality in balancing decisions by keeping up to
      capacity amount of tasks local before looking for idle
      CPUs. (and twice the capacity if SD_POWERSAVINGS_BALANCE
      is set.)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      59abf026
  20. 15 9月, 2009 5 次提交
    • P
      sched: Reduce forkexec_idx · b8a543ea
      Peter Zijlstra 提交于
      If we're looking to place a new task, we might as well find the
      idlest position _now_, not 1 tick ago.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b8a543ea
    • M
      sched: Improve latencies and throughput · 0ec9fab3
      Mike Galbraith 提交于
      Make the idle balancer more agressive, to improve a
      x264 encoding workload provided by Jason Garrett-Glaser:
      
       NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 252.82 fps, 22096.60 kb/s
       encoded 600 frames, 250.69 fps, 22096.60 kb/s
       encoded 600 frames, 245.76 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY LB_BIAS
       encoded 600 frames, 344.44 fps, 22096.60 kb/s
       encoded 600 frames, 346.66 fps, 22096.60 kb/s
       encoded 600 frames, 352.59 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 425.75 fps, 22096.60 kb/s
       encoded 600 frames, 425.45 fps, 22096.60 kb/s
       encoded 600 frames, 422.49 fps, 22096.60 kb/s
      
      Peter pointed out that this is better done via newidle_idx,
      not via LB_BIAS, newidle balancing should look for where
      there is load _now_, not where there was load 2 ticks ago.
      
      Worst-case latencies are improved as well as no buddies
      means less vruntime spread. (as per prior lkml discussions)
      
      This change improves kbuild-peak parallelism as well.
      Reported-by: NJason Garrett-Glaser <darkshikari@gmail.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1253011667.9128.16.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ec9fab3
    • P
      sched: Fix some domain tunings · 6bd7821f
      Peter Zijlstra 提交于
      CPU level should have WAKE_AFFINE, whereas ALLNODES is dubious.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bd7821f
    • P
      sched: Tweak wake_idx · 78e7ed53
      Peter Zijlstra 提交于
      When merging select_task_rq_fair() and sched_balance_self() we lost
      the use of wake_idx, restore that and set them to 0 to make wake
      balancing more aggressive.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      78e7ed53
    • P
      sched: Merge select_task_rq_fair() and sched_balance_self() · c88d5910
      Peter Zijlstra 提交于
      The problem with wake_idle() is that is doesn't respect things like
      cpu_power, which means it doesn't deal well with SMT nor the recent
      RT interaction.
      
      To cure this, it needs to do what sched_balance_self() does, which
      leads to the possibility of merging select_task_rq_fair() and
      sched_balance_self().
      
      Modify sched_balance_self() to:
      
        - update_shares() when walking up the domain tree,
          (it only called it for the top domain, but it should
           have done this anyway), which allows us to remove
          this ugly bit from try_to_wake_up().
      
        - do wake_affine() on the smallest domain that contains
          both this (the waking) and the prev (the wakee) cpu for
          WAKE invocations.
      
      Then use the top-down balance steps it had to replace wake_idle().
      
      This leads to the dissapearance of SD_WAKE_BALANCE and
      SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
      
      SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
      
      Touch all topology bits to replace the old with new SD flags --
      platforms might need re-tuning, enabling SD_BALANCE_WAKE
      conditionally on a NUMA distance seems like a good additional
      feature, magny-core and small nehalem systems would want this
      enabled, systems with slow interconnects would not.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c88d5910
  21. 08 9月, 2009 1 次提交
  22. 04 9月, 2009 3 次提交
    • I
      sched: Turn on SD_BALANCE_NEWIDLE · 840a0653
      Ingo Molnar 提交于
      Start the re-tuning of the balancer by turning on newidle.
      
      It improves hackbench performance and parallelism on a 4x4 box.
      The "perf stat --repeat 10" measurements give us:
      
        domain0             domain1
        .......................................
       -SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2041.273208  task-clock-msecs         #      9.354 CPUs    ( +-   0.363% )
      
       +SD_BALANCE_NEWIDLE -SD_BALANCE_NEWIDLE:
         2086.326925  task-clock-msecs         #     11.934 CPUs    ( +-   0.301% )
      
       +SD_BALANCE_NEWIDLE +SD_BALANCE_NEWIDLE:
         2115.289791  task-clock-msecs         #     12.158 CPUs    ( +-   0.263% )
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      840a0653
    • I
      sched: Clean up topology.h · 47734f89
      Ingo Molnar 提交于
      Re-organize the flag settings so that it's visible at a glance
      which sched-domains flags are set and which not.
      
      With the new balancer code we'll need to re-tune these details
      anyway, so make it cleaner to make fewer mistakes down the
      road ;-)
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47734f89
    • P
      sched: Add smt_gain · a52bfd73
      Peter Zijlstra 提交于
      The idea is that multi-threading a core yields more work
      capacity than a single thread, provide a way to express a
      static gain for threads.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
      Acked-by: NGautham R Shenoy <ego@in.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      LKML-Reference: <20090901083826.073345955@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a52bfd73
  23. 13 3月, 2009 2 次提交
  24. 12 1月, 2009 1 次提交
  25. 19 12月, 2008 2 次提交
  26. 12 12月, 2008 1 次提交
  27. 07 11月, 2008 2 次提交
  28. 06 11月, 2008 1 次提交
    • I
      sched: re-tune balancing · 9fcd18c9
      Ingo Molnar 提交于
      Impact: improve wakeup affinity on NUMA systems, tweak SMP systems
      
      Given the fixes+tweaks to the wakeup-buddy code, re-tweak the domain
      balancing defaults on NUMA and SMP systems.
      
      Turn on SD_WAKE_AFFINE which was off on x86 NUMA - there's no reason
      why we would not want to have wakeup affinity across nodes as well.
      (we already do this in the standard NUMA template.)
      
      lat_ctx on a NUMA box is particularly happy about this change:
      
      before:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.60
       |   2 5.70
      
      after:
      
       |   phoenix:~/l> ./lat_ctx -s 0 2
       |   "size=0k ovr=2.65
       |   2 2.07
      
      a 2.75x speedup.
      
      pipe-test is similarly happy about it too:
      
       |  phoenix:~/sched-tests> ./pipe-test
       |   18.26 usecs/loop.
       |   14.70 usecs/loop.
       |   14.38 usecs/loop.
       |   10.55 usecs/loop.              # +WAKE_AFFINE on domain0+domain1
       |   8.63 usecs/loop.
       |   8.59 usecs/loop.
       |   9.03 usecs/loop.
       |   8.94 usecs/loop.
       |   8.96 usecs/loop.
       |   8.63 usecs/loop.
      
      Also:
      
       - disable SD_BALANCE_NEWIDLE on NUMA and SMP domains (keep it for siblings)
       - enable SD_WAKE_BALANCE on SMP domains
      
      Sysbench+postgresql improves all around the board, quite significantly:
      
                 .28-rc3-11474e2c  .28-rc3-11474e2c-tune
      -------------------------------------------------
          1:             571              688    +17.08%
          2:            1236             1206    -2.55%
          4:            2381             2642    +9.89%
          8:            4958             5164    +3.99%
         16:            9580             9574    -0.07%
         32:            7128             8118    +12.20%
         64:            7342             8266    +11.18%
        128:            7342             8064    +8.95%
        256:            7519             7884    +4.62%
        512:            7350             7731    +4.93%
      -------------------------------------------------
        SUM:           55412            59341    +6.62%
      
      So it's a win both for the runup portion, the peak area and the tail.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9fcd18c9