1. 27 5月, 2015 1 次提交
  2. 05 6月, 2014 1 次提交
    • M
      mm: disable zone_reclaim_mode by default · 4f9b16a6
      Mel Gorman 提交于
      When it was introduced, zone_reclaim_mode made sense as NUMA distances
      punished and workloads were generally partitioned to fit into a NUMA
      node.  NUMA machines are now common but few of the workloads are
      NUMA-aware and it's routine to see major performance degradation due to
      zone_reclaim_mode being enabled but relatively few can identify the
      problem.
      
      Those that require zone_reclaim_mode are likely to be able to detect
      when it needs to be enabled and tune appropriately so lets have a
      sensible default for the bulk of users.
      
      This patch (of 2):
      
      zone_reclaim_mode causes processes to prefer reclaiming memory from
      local node instead of spilling over to other nodes.  This made sense
      initially when NUMA machines were almost exclusively HPC and the
      workload was partitioned into nodes.  The NUMA penalties were
      sufficiently high to justify reclaiming the memory.  On current machines
      and workloads it is often the case that zone_reclaim_mode destroys
      performance but not all users know how to detect this.  Favour the
      common case and disable it by default.  Users that are sophisticated
      enough to know they need zone_reclaim_mode will detect it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9b16a6
  3. 28 5月, 2014 2 次提交
  4. 11 3月, 2014 1 次提交
  5. 15 1月, 2014 1 次提交
    • S
      powerpc: Fix the setup of CPU-to-Node mappings during CPU online · d4edc5b6
      Srivatsa S. Bhat 提交于
      On POWER platforms, the hypervisor can notify the guest kernel about dynamic
      changes in the cpu-numa associativity (VPHN topology update). Hence the
      cpu-to-node mappings that we got from the firmware during boot, may no longer
      be valid after such updates. This is handled using the arch_update_cpu_topology()
      hook in the scheduler, and the sched-domains are rebuilt according to the new
      mappings.
      
      But unfortunately, at the moment, CPU hotplug ignores these updated mappings
      and instead queries the firmware for the cpu-to-numa relationships and uses
      them during CPU online. So the kernel can end up assigning wrong NUMA nodes
      to CPUs during subsequent CPU hotplug online operations (after booting).
      
      Further, a particularly problematic scenario can result from this bug:
      On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
      threads per core. The switch to Single-Threaded (ST) mode is performed by
      offlining all except the first CPU thread in each core. Switching back to
      SMT mode involves onlining those other threads back, in each core.
      
      Now consider this scenario:
      
      1. During boot, the kernel gets the cpu-to-node mappings from the firmware
         and assigns the CPUs to NUMA nodes appropriately, during CPU online.
      
      2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
         communicates this update to the kernel. The kernel in turn updates its
         cpu-to-node associations and rebuilds its sched domains. Everything is
         fine so far.
      
      3. Now, the user switches the machine from SMT to ST mode (say, by running
         ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
         core.
      
      4. The user then tries to switch back from ST to SMT mode (say, by running
         ppc64_cpu --smt=4), and this involves onlining those threads back. Since
         CPU hotplug ignores the new mappings, it queries the firmware and tries to
         associate the newly onlined sibling threads to the old NUMA nodes. This
         results in sibling threads within the same core getting associated with
         different NUMA nodes, which is incorrect.
      
         The scheduler's build-sched-domains code gets thoroughly confused with this
         and enters an infinite loop and causes soft-lockups, as explained in detail
         in commit 3be7db6a (powerpc: VPHN topology change updates all siblings).
      
      So to fix this, use the numa_cpu_lookup_table to remember the updated
      cpu-to-node mappings, and use them during CPU hotplug online operations.
      Further, we also need to ensure that all threads in a core are assigned to a
      common NUMA node, irrespective of whether all those threads were online during
      the topology update. To achieve this, we take care not to use cpu_sibling_mask()
      since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
      and set up the mappings manually using the 'threads_per_core' value for that
      particular platform. This helps us ensure that we don't hit this bug with any
      combination of CPU hotplug and SMT mode switching.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d4edc5b6
  6. 14 8月, 2013 1 次提交
  7. 26 4月, 2013 1 次提交
  8. 09 5月, 2012 1 次提交
    • P
      sched/numa: Rewrite the CONFIG_NUMA sched domain support · cb83b629
      Peter Zijlstra 提交于
      The current code groups up to 16 nodes in a level and then puts an
      ALLNODES domain spanning the entire tree on top of that. This doesn't
      reflect the numa topology and esp for the smaller not-fully-connected
      machines out there today this might make a difference.
      
      Therefore, build a proper numa topology based on node_distance().
      
      Since there's no fixed numa layers anymore, the static SD_NODE_INIT
      and SD_ALLNODES_INIT aren't usable anymore, the new code tries to
      construct something similar and scales some values either on the
      number of cpus in the domain and/or the node_distance() ratio.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mips@linux-mips.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-sh@vger.kernel.org
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: sparclinux@vger.kernel.org
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86@kernel.org
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Greg Pearson <greg.pearson@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: bob.picco@oracle.com
      Cc: chris.mason@oracle.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb83b629
  9. 22 12月, 2011 1 次提交
    • K
      cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem · 8a25a2fd
      Kay Sievers 提交于
      This moves the 'cpu sysdev_class' over to a regular 'cpu' subsystem
      and converts the devices to regular devices. The sysdev drivers are
      implemented as subsystem interfaces now.
      
      After all sysdev classes are ported to regular driver core entities, the
      sysdev implementation will be entirely removed from the kernel.
      
      Userspace relies on events and generic sysfs subsystem infrastructure
      from sysdev devices, which are made available with this conversion.
      
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@amd64.org>
      Cc: Tigran Aivazian <tigran@aivazian.fsnet.co.uk>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <rmk+kernel@arm.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      8a25a2fd
  10. 20 9月, 2011 4 次提交
  11. 12 1月, 2011 1 次提交
  12. 11 1月, 2011 1 次提交
    • J
      powerpc/pseries: Fix VPHN build errors on non-SMP systems · 39bf990e
      Jesse Larrew 提交于
      The header asm/hvcall.h was previously included indirectly via
      smp.h. On non-SMP systems, however, these declarations are excluded
      and the build breaks. This is easily fixed by including asm/hvcall.h
      directly.
      
      The VPHN feature is only meaningful on NUMA systems that implement
      the SPLPAR option, so exclude the VPHN code on systems without
      SPLPAR enabled.
      
      Also, expose unmap_cpu_from_node() on systems with SPLPAR enabled,
      even if CONFIG_HOTPLUG_CPU is disabled.
      
      Lastly, map_cpu_to_node() is now needed by VPHN to manipulate the
      node masks after boot time, so remove the __cpuinit annotation to
      fix a section mismatch.
      Signed-off-by: NJesse Larrew <jlarrew@linux.vnet.ibm.com>
      39bf990e
  13. 09 12月, 2010 1 次提交
  14. 30 7月, 2010 1 次提交
  15. 09 7月, 2010 1 次提交
    • A
      powerpc/numa: Use form 1 affinity to setup node distance · 41eab6f8
      Anton Blanchard 提交于
      Form 1 affinity allows multiple entries in ibm,associativity-reference-points
      which represent affinity domains in decreasing order of importance. The
      Linux concept of a node is always the first entry, but using the other
      values as an input to node_distance() allows the memory allocator to make
      better decisions on which node to go first when local memory has been
      exhausted.
      
      We keep things simple and create an array indexed by NUMA node, capped at
      4 entries. Each time we lookup an associativity property we initialise
      the array which is overkill, but since we should only hit this path during
      boot it didn't seem worth adding a per node valid bit.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      41eab6f8
  16. 21 5月, 2010 1 次提交
    • A
      powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim · 56608209
      Anton Blanchard 提交于
      I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
      enabled via this:
      
              /*
               * If another node is sufficiently far away then it is better
               * to reclaim pages in a zone before going off node.
               */
              if (distance > RECLAIM_DISTANCE)
                      zone_reclaim_mode = 1;
      
      Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
      RECLAIM_DISTANCE it never kicks in.
      
      The local to remote bandwidth ratios can be quite large on System p
      machines so it makes sense for us to reclaim clean pagecache locally before
      going off node.
      
      The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
      zone reclaim.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      56608209
  17. 06 5月, 2010 2 次提交
  18. 07 4月, 2010 1 次提交
    • A
      powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim · 27f10907
      Anton Blanchard 提交于
      I noticed /proc/sys/vm/zone_reclaim_mode was 0 on a ppc64 NUMA box. It gets
      enabled via this:
      
              /*
               * If another node is sufficiently far away then it is better
               * to reclaim pages in a zone before going off node.
               */
              if (distance > RECLAIM_DISTANCE)
                      zone_reclaim_mode = 1;
      
      Since we use the default value of 20 for REMOTE_DISTANCE and 20 for
      RECLAIM_DISTANCE it never kicks in.
      
      The local to remote bandwidth ratios can be quite large on System p
      machines so it makes sense for us to reclaim clean pagecache locally before
      going off node.
      
      The patch below sets a smaller value for RECLAIM_DISTANCE and thus enables
      zone reclaim.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      27f10907
  19. 09 2月, 2010 1 次提交
  20. 15 1月, 2010 1 次提交
    • A
      powerpc: cpumask_of_node() should handle -1 as a node · c81b812a
      Anton Blanchard 提交于
      pcibus_to_node can return -1 if we cannot determine which node a pci bus
      is on. If passed -1, cpumask_of_node will negatively index the lookup array
      and pull in random data:
      
      # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus
      00000000,00000003,00000000,00000000
      # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist
      64-65
      
      Change cpumask_of_node to check for -1 and return cpu_all_mask in this
      case:
      
      # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpus
      ffffffff,ffffffff,ffffffff,ffffffff
      # cat /sys/devices/pci0000:00/0000:00:01.0/local_cpulist
      0-127
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c81b812a
  21. 24 9月, 2009 3 次提交
  22. 16 9月, 2009 1 次提交
    • P
      sched: Disable wakeup balancing · 182a85f8
      Peter Zijlstra 提交于
      Sysbench thinks SD_BALANCE_WAKE is too agressive and kbuild doesn't
      really mind too much, SD_BALANCE_NEWIDLE picks up most of the
      slack.
      
      On a dual socket, quad core, dual thread nehalem system:
      
      sysbench (--num_threads=16):
      
       SD_BALANCE_WAKE-: 13982 tx/s
       SD_BALANCE_WAKE+: 15688 tx/s
      
      kbuild (-j16):
      
       SD_BALANCE_WAKE-: 47.648295846  seconds time elapsed   ( +-   0.312% )
       SD_BALANCE_WAKE+: 47.608607360  seconds time elapsed   ( +-   0.026% )
      
      (same within noise)
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      182a85f8
  23. 15 9月, 2009 3 次提交
    • M
      sched: Improve latencies and throughput · 0ec9fab3
      Mike Galbraith 提交于
      Make the idle balancer more agressive, to improve a
      x264 encoding workload provided by Jason Garrett-Glaser:
      
       NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 252.82 fps, 22096.60 kb/s
       encoded 600 frames, 250.69 fps, 22096.60 kb/s
       encoded 600 frames, 245.76 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY LB_BIAS
       encoded 600 frames, 344.44 fps, 22096.60 kb/s
       encoded 600 frames, 346.66 fps, 22096.60 kb/s
       encoded 600 frames, 352.59 fps, 22096.60 kb/s
      
       NO_NEXT_BUDDY NO_LB_BIAS
       encoded 600 frames, 425.75 fps, 22096.60 kb/s
       encoded 600 frames, 425.45 fps, 22096.60 kb/s
       encoded 600 frames, 422.49 fps, 22096.60 kb/s
      
      Peter pointed out that this is better done via newidle_idx,
      not via LB_BIAS, newidle balancing should look for where
      there is load _now_, not where there was load 2 ticks ago.
      
      Worst-case latencies are improved as well as no buddies
      means less vruntime spread. (as per prior lkml discussions)
      
      This change improves kbuild-peak parallelism as well.
      Reported-by: NJason Garrett-Glaser <darkshikari@gmail.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1253011667.9128.16.camel@marge.simson.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0ec9fab3
    • P
      sched: Tweak wake_idx · 78e7ed53
      Peter Zijlstra 提交于
      When merging select_task_rq_fair() and sched_balance_self() we lost
      the use of wake_idx, restore that and set them to 0 to make wake
      balancing more aggressive.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      78e7ed53
    • P
      sched: Merge select_task_rq_fair() and sched_balance_self() · c88d5910
      Peter Zijlstra 提交于
      The problem with wake_idle() is that is doesn't respect things like
      cpu_power, which means it doesn't deal well with SMT nor the recent
      RT interaction.
      
      To cure this, it needs to do what sched_balance_self() does, which
      leads to the possibility of merging select_task_rq_fair() and
      sched_balance_self().
      
      Modify sched_balance_self() to:
      
        - update_shares() when walking up the domain tree,
          (it only called it for the top domain, but it should
           have done this anyway), which allows us to remove
          this ugly bit from try_to_wake_up().
      
        - do wake_affine() on the smallest domain that contains
          both this (the waking) and the prev (the wakee) cpu for
          WAKE invocations.
      
      Then use the top-down balance steps it had to replace wake_idle().
      
      This leads to the dissapearance of SD_WAKE_BALANCE and
      SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
      
      SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
      
      Touch all topology bits to replace the old with new SD flags --
      platforms might need re-tuning, enabling SD_BALANCE_WAKE
      conditionally on a NUMA distance seems like a good additional
      feature, magny-core and small nehalem systems would want this
      enabled, systems with slow interconnects would not.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c88d5910
  24. 30 3月, 2009 1 次提交
  25. 01 1月, 2009 1 次提交
  26. 26 12月, 2008 1 次提交
  27. 26 11月, 2008 1 次提交
    • I
      sched: convert struct root_domain to cpumask_var_t, fix · 1c391948
      Ingo Molnar 提交于
      Mathieu Desnoyers reported this build failure on powerpc:
      
       kernel/sched.c: In function 'sd_init_NODE':
       kernel/sched.c:7319: error: non-static initialization of a flexible array member
       kernel/sched.c:7319: error: (near initialization for '(anonymous)')
      
      this happens because .span changed to cpumask_var_t, hence
      the static CPU_MASK_NONE initializers in the SD_*_INIT
      templates are not type-correct anymore.
      
      Remove them, as they default to empty anyway.
      
      Also remove them from IA64, MIPS and SH.
      Reported-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1c391948
  28. 04 8月, 2008 1 次提交
  29. 28 7月, 2008 2 次提交
  30. 20 4月, 2008 1 次提交
    • M
      asm-generic: add node_to_cpumask_ptr macro · aa6b5446
      Mike Travis 提交于
      Create a simple macro to always return a pointer to the node_to_cpumask(node)
      value.  This relies on compiler optimization to remove the extra indirection:
      
          #define node_to_cpumask_ptr(v, node) 		\
      	    cpumask_t _##v = node_to_cpumask(node), *v = &_##v
      
      For those systems with a large cpumask size, then a true pointer
      to the array element can be used:
      
          #define node_to_cpumask_ptr(v, node)		\
      	    cpumask_t *v = &(node_to_cpumask_map[node])
      
      A node_to_cpumask_ptr_next() macro is provided to access another
      node_to_cpumask value.
      
      The other change is to always include asm-generic/topology.h moving the
      ifdef CONFIG_NUMA to this same file.
      
      Note: there are no references to either of these new macros in this patch,
      only the definition.
      
      Based on 2.6.25-rc5-mm1
      
      # alpha
      Cc: Richard Henderson <rth@twiddle.net>
      
      # fujitsu
      Cc: David Howells <dhowells@redhat.com>
      
      # ia64
      Cc: Tony Luck <tony.luck@intel.com>
      
      # powerpc
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Anton Blanchard <anton@samba.org>
      
      # sparc
      Cc: David S. Miller <davem@davemloft.net>
      Cc: William L. Irwin <wli@holomorphy.com>
      
      # x86
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa6b5446