1. 17 2月, 2011 11 次提交
    • T
      x86-64, NUMA: Kill numa_nodes[] · 91556237
      Tejun Heo 提交于
      numa_nodes[] doesn't carry any information which isn't present in
      numa_meminfo.  Each entry is simply min/max range of all the memblks
      for the node.  This is not only redundant but also inaccurate when
      memblks for different nodes interleave - for example,
      find_node_by_addr() can return the wrong nodeid.
      
      Kill numa_nodes[] and always use numa_meminfo instead.
      
      * nodes_cover_memory() is renamed to numa_meminfo_cover_memory() and
        now operations on numa_meminfo and returns bool.
      
      * setup_node_bootmem() needs min/max range.  Compute the range on the
        fly.  setup_node_bootmem() invocation is restructured to use outer
        loop instead of hardcoding the double invocations.
      
      * find_node_by_addr() now operates on numa_meminfo.
      
      * setup_physnodes() builds physnodes[] from memblks.  This will go
        away when emulation code is updated to use struct numa_meminfo.
      
      This patch also makes the following misc changes.
      
      * Clearing of nodes_add[] clearing is converted to memset().
      
      * numa_add_memblk() in amd_numa_init() is moved down a bit for
        consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      91556237
    • T
      x86-64, NUMA: Add common find_node_by_addr() · a844ef46
      Tejun Heo 提交于
      srat_64.c and amdtopology_64.c had their own versions of
      find_node_by_addr() which were basically the same.  Add common one in
      numa_64.c and remove the duplicates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      a844ef46
    • T
      x86-64, NUMA: consolidate and improve memblk sanity checks · 56e827fb
      Tejun Heo 提交于
      memblk sanity check was scattered around and incomplete.  Consolidate
      and improve.
      
      * Confliction detection and cutoff_node() logic are moved to
        numa_cleanup_meminfo().
      
      * numa_cleanup_meminfo() clears the unused memblks before returning.
      
      * Check and warn about invalid input parameters in numa_add_memblk().
      
      * Check the maximum number of memblk isn't exceeded in
        numa_add_memblk().
      
      * numa_cleanup_meminfo() is now called before numa_emulation() so that
        the emulation code also uses the cleaned up version.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      56e827fb
    • T
      x86-64, NUMA: make numa_cleanup_meminfo() prettier · 2e756be4
      Tejun Heo 提交于
      * Factor out numa_remove_memblk_from().
      
      * Hole detection doesn't need separate start/end.  Calculate start/end
        once.
      
      * Relocate comment.
      
      * Define iterators at the top and remove unnecessary prefix
        increments.
      
      This prepares for further improvements to the function.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      2e756be4
    • T
      x86-64, NUMA: Separate out numa_cleanup_meminfo() · f9c60251
      Tejun Heo 提交于
      Separate out numa_cleanup_meminfo() from numa_register_memblks().
      node_possible_map initialization is moved to the top of the split
      numa_register_memblks().
      
      This patch doesn't cause behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      f9c60251
    • T
      x86-64, NUMA: Introduce struct numa_meminfo · 97e7b78d
      Tejun Heo 提交于
      Arrays for memblks and nodeids and their length lived in separate
      variables making things unnecessarily cumbersome.  Introduce struct
      numa_meminfo which contains all memory configuration info.  This patch
      doesn't cause any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      97e7b78d
    • T
      x86-64, NUMA: Remove %NULL @nodeids handling from compute_hash_shift() · 8968dab8
      Tejun Heo 提交于
      numa_emulation() called compute_hash_shift() with %NULL @nodeids which
      meant identity mapping between index and nodeid.  Make
      numa_emulation() build identity array and drop %NULL @nodeids handling
      from populate_memnodemap() and thus from compute_hash_shift().  This
      is to prepare for transition to using memblks instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      8968dab8
    • T
      x86-64, NUMA: Kill {acpi|amd|dummy}_scan_nodes() · 5d371b08
      Tejun Heo 提交于
      They are empty now.  Kill them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      5d371b08
    • T
      x86-64, NUMA: Unify the rest of memblk registration · fd0435d8
      Tejun Heo 提交于
      Move the remaining memblk registration logic from acpi_scan_nodes() to
      numa_register_memblks() and initmem_init().
      
      This applies nodes_cover_memory() sanity check, memory node sorting
      and node_online() checking, which were only applied to acpi, to all
      init methods.
      
      As all memblk registration is moved to common code, active range
      clearing is moved to initmem_init() too and removed from bad_srat().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      fd0435d8
    • T
      x86-64, NUMA: Unify use of memblk in all init methods · 43a662f0
      Tejun Heo 提交于
      Make both amd and dummy use numa_add_memblk() to describe the detected
      memory blocks.  This allows initmem_init() to call
      numa_register_memblk() regardless of init method in use.  Drop custom
      memory registration codes from amd and dummy.
      
      After this change, memblk merge/cleanup in numa_register_memblks() is
      applied to all init methods.
      
      As this makes compute_hash_shift() and numa_register_memblks() used
      only inside numa_64.c, make them static.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      43a662f0
    • T
      x86-64, NUMA: Factor out memblk handling into numa_{add|register}_memblk() · ef396ec9
      Tejun Heo 提交于
      Factor out memblk handling from srat_64.c into two functions in
      numa_64.c.  This patch doesn't introduce any behavior change.  The
      next patch will make all init methods use these functions.
      
      - v2: Fixed build failure on 32bit due to misplaced NR_NODE_MEMBLKS.
            Reported by Ingo.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ef396ec9
  2. 16 2月, 2011 12 次提交
    • T
      x86-64, NUMA: Kill {acpi|amd}_get_nodes() · 19095548
      Tejun Heo 提交于
      With common numa_nodes[], common code in numa_64.c can access it
      directly.  Copy directly and kill {acpi|amd}_get_nodes().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      19095548
    • T
      x86-64, NUMA: Use common numa_nodes[] · 206e4208
      Tejun Heo 提交于
      ACPI and amd are using separate nodes[] array.  Add numa_nodes[] and
      use them in all NUMA init methods.  cutoff_node() cleanup is moved
      from srat_64.c to numa_64.c and applied in initmem_init() regardless
      of init methods.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      206e4208
    • T
      x86-64, NUMA: Move apicid to numa mapping initialization from amd_scan_nodes() to amd_numa_init() · 45fe6c78
      Tejun Heo 提交于
      This brings amd initialization behavior closer to that of acpi.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      45fe6c78
    • T
      x86-64, NUMA: Remove local variable found from amd_numa_init() · 99df738c
      Tejun Heo 提交于
      Use weight count on mem_nodes_parsed instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      99df738c
    • T
      x86-64, NUMA: Use common {cpu|mem}_nodes_parsed · ec8cf29b
      Tejun Heo 提交于
      ACPI and amd are using separate nodes_parsed masks.  Add
      {cpu|mem}_nodes_parsed and use them in all NUMA init methods.
      Initialization of the masks and building node_possible_map are now
      handled commonly by initmem_init().
      
      dummy_numa_init() is updated to set node 0 on both masks.  While at
      it, move the info messages from scan to init.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ec8cf29b
    • T
      x86-64, NUMA: Restructure initmem_init() · ffe77a46
      Tejun Heo 提交于
      Reorganize initmem_init() such that,
      
      * Different NUMA init methods are iterated in a consistent way.
      
      * Each iteration re-initializes all the parameters and different
        method can be tried after a failure.
      
      * Dummy init is handled the same as other methods.
      
      Apart from how retry after failure, this patch doesn't change the
      behavior.  The call sequences are kept equivalent across the
      conversion.
      
      After the change, bad_srat() doesn't need to clear apic to node
      mapping or worry about numa_off.  Simplified accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ffe77a46
    • T
      x86, NUMA: Move *_numa_init() invocations into initmem_init() · d8fc3afc
      Tejun Heo 提交于
      There's no reason for these to live in setup_arch().  Move them inside
      initmem_init().
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Ankita.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ankita Garg <ankita@in.ibm.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      d8fc3afc
    • T
      x86-64, NUMA: Wrap acpi_numa_init() so that failure can be indicated by return value · a9aec56a
      Tejun Heo 提交于
      Because of the way ACPI tables are parsed, the generic
      acpi_numa_init() couldn't return failure when error was detected by
      arch hooks.  Instead, the failure state was recorded and later arch
      dependent init hook - acpi_scan_nodes() - would fail.
      
      Wrap acpi_numa_init() with x86_acpi_numa_init() so that failure can be
      indicated as return value immediately.  This is in preparation for
      further NUMA init cleanups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      a9aec56a
    • T
      x86-64, NUMA: Unify {acpi|amd}_{numa_init|scan_nodes}() arguments and return values · 940fed2e
      Tejun Heo 提交于
      The functions used during NUMA initialization - *_numa_init() and
      *_scan_nodes() - have different arguments and return values.  Unify
      them such that they all take no argument and return 0 on success and
      -errno on failure.  This is in preparation for further NUMA init
      cleanups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      940fed2e
    • T
      x86, NUMA: Drop @start/last_pfn from initmem_init() · 86ef4dbf
      Tejun Heo 提交于
      initmem_init() extensively accesses and modifies global data
      structures and the parameters aren't even followed depending on which
      path is being used.  Drop @start/last_pfn and let it deal with
      @max_pfn directly.  This is in preparation for further NUMA init
      cleanups.
      
      - v2: x86-32 initmem_init() weren't updated breaking 32bit builds.
        Fixed.  Found by Yinghai.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      86ef4dbf
    • T
      x86-64, NUMA: Simplify hotplug node handling in acpi_numa_memory_affinity_init() · 13081df5
      Tejun Heo 提交于
      Hotplug node handling in acpi_numa_memory_affinity_init() was
      unnecessarily complicated with storing the original nodes[] entry and
      restoring it afterwards.  Simplify it by not modifying the nodes[]
      entry for hotplug nodes from the beginning.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      13081df5
    • T
      x86-64, NUMA: Make dummy node initialization path similar to non-dummy ones · 7d36b7bc
      Tejun Heo 提交于
      Dummy node initialization in initmem_init() didn't initialize apicid
      to node mapping and set cpu to node mapping directly by caling
      numa_set_node(), which is different from non-dummy init paths.
      
      Update it such that they behave similarly.  Initialize apicid to node
      mapping and call numa_init_array().  The actual cpu to node mapping is
      handled by init_cpu_to_node() later.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      7d36b7bc
  3. 15 2月, 2011 1 次提交
    • B
      x86, amd: Initialize variable properly · 9e81509e
      Borislav Petkov 提交于
      Commit d518573d ("x86, amd: Normalize compute unit IDs on
      multi-node processors") introduced compute unit normalization
      but causes a compiler warning:
      
       arch/x86/kernel/cpu/amd.c: In function 'amd_detect_cmp':
       arch/x86/kernel/cpu/amd.c:268: warning: 'cores_per_cu' may be used uninitialized in this function
       arch/x86/kernel/cpu/amd.c:268: note: 'cores_per_cu' was declared here
      
      The compiler is right - initialize it with a proper value.
      
      Also, fix up a comment while at it.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
      LKML-Reference: <20110214171451.GB10076@kryptos.osrc.amd.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9e81509e
  4. 14 2月, 2011 7 次提交
    • D
      x86, numa: Add error handling for bad cpu-to-node mappings · 14392fd3
      David Rientjes 提交于
      CONFIG_DEBUG_PER_CPU_MAPS may return NUMA_NO_NODE when an
      early_cpu_to_node() mapping hasn't been initialized.  In such a
      case, it emits a warning and continues without an issue but
      callers may try to use the return value to index into an array.
      
      We can catch those errors and fail silently since a warning has
      already been emitted.  No current user of numa_add_cpu()
      requires this error checking to avoid a crash, but it's better
      to be proactive in case a future user happens to have a bug and
      a user tries to diagnose it with CONFIG_DEBUG_PER_CPU_MAPS.
      Reported-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.1102071407250.7812@chino.kir.corp.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      14392fd3
    • K
      x86: Emit "mem=nopentium ignored" warning when not supported · 9a6d44b9
      Kamal Mostafa 提交于
      Emit warning when "mem=nopentium" is specified on any arch other
      than x86_32 (the only that arch supports it).
      Signed-off-by: NKamal Mostafa <kamal@canonical.com>
      BugLink: http://bugs.launchpad.net/bugs/553464
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      LKML-Reference: <1296783486-23033-2-git-send-email-kamal@canonical.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>
      9a6d44b9
    • K
      x86: Fix panic when handling "mem={invalid}" param · 77eed821
      Kamal Mostafa 提交于
      Avoid removing all of memory and panicing when "mem={invalid}"
      is specified, e.g. mem=blahblah, mem=0, or mem=nopentium (on
      platforms other than x86_32).
      Signed-off-by: NKamal Mostafa <kamal@canonical.com>
      BugLink: http://bugs.launchpad.net/bugs/553464
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: <stable@kernel.org> # .3x: as far back as it applies
      LKML-Reference: <1296783486-23033-1-git-send-email-kamal@canonical.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77eed821
    • S
      x86: Avoid tlbstate lock if not enough cpus · 7064d865
      Shaohua Li 提交于
      This one isn't related to previous patch. If online cpus are
      below NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The
      comments in the code declares we don't need the check, but a hot
      lock still needs an atomic operation and expensive, so add the
      check here.
      
      Uses nr_cpu_ids here as suggested by Eric Dumazet.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      LKML-Reference: <1295232730.1949.710.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7064d865
    • S
      x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32 · 70e4a369
      Shaohua Li 提交于
      Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
      with a maximum of 32 vectors.
      
      We currently only have 8 vectors for TLB invalidation and that is clearly
      inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
      tlbstate_lock is used to protect them. flush_tlb_page() is
      heavily used in page reclaim, which will cause a lot of lock
      contention for tlbstate_lock.
      
      Andi Kleen suggested increasing the vectors number to 32, which should be
      good for current typical systems to reduce the tlbstate_lock contention.
      
      My test system has 4 sockets and 64G memory, and 64 CPUs. My
      workload creates 64 processes. Each process mmap reads a big
      empty sparse file. The total size of the files are 2*total_mem,
      so this will cause a lot of page reclaim.
      
      Below is the result I get from perf call-graph profiling:
      
       without the patch:
       ------------------
      
          24.25%           usemem  [kernel]                                   [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |
                              |--42.15%-- native_flush_tlb_others
      
       with the patch:
       ------------------
      
          14.96%           usemem  [kernel]                                   [k] _raw_spin_lock
                           |
                           --- _raw_spin_lock
                              |--13.89%-- native_flush_tlb_others
      
      So this heavily reduces the tlbstate_lock contention.
      Suggested-by: NAndi Kleen <andi@firstfloor.org>
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      70e4a369
    • S
      x86: Allocate 32 tlb_invalidate_interrupt handler stubs · 3a09fb45
      Shaohua Li 提交于
      Add up to 32 invalidate_interrupt handlers. How many handlers are
      added depends on NUM_INVALIDATE_TLB_VECTORS. So if
      NUM_INVALIDATE_TLB_VECTORS is smaller than 32, we reduce code
      size.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <1295232725.1949.708.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a09fb45
    • S
      x86: Cleanup vector usage · 60f6e65d
      Shaohua Li 提交于
      Cleanup the vector usage and make them continuous if possible.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <1295232722.1949.707.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      60f6e65d
  5. 10 2月, 2011 2 次提交
  6. 08 2月, 2011 1 次提交
    • H
      x86, amd: Support L3 Cache Partitioning on AMD family 0x15 CPUs · cabb5bd7
      Hans Rosenfeld 提交于
      L3 Cache Partitioning allows selecting which of the 4 L3 subcaches can be used
      for evictions by the L2 cache of each compute unit. By writing a 4-bit
      hexadecimal mask into the the sysfs file
      /sys/devices/system/cpu/cpuX/cache/index3/subcaches, the user can set the
      enabled subcaches for a CPU.
      
      The settings are directly read from and written to the hardware, so there is no
      way to have contradicting settings for two CPUs belonging to the same compute
      unit. Writing will always overwrite any previous setting for a compute unit.
      Signed-off-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: <Andreas.Herrmann3@amd.com>
      LKML-Reference: <1297098639-431383-1-git-send-email-hans.rosenfeld@amd.com>
      [ -v3: minor style fixes ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cabb5bd7
  7. 07 2月, 2011 1 次提交
  8. 05 2月, 2011 1 次提交
  9. 04 2月, 2011 1 次提交
    • S
      x86, mm: avoid possible bogus tlb entries by clearing prev mm_cpumask after switching mm · 831d52bc
      Suresh Siddha 提交于
      Clearing the cpu in prev's mm_cpumask early will avoid the flush tlb
      IPI's while the cr3 is still pointing to the prev mm.  And this window
      can lead to the possibility of bogus TLB fills resulting in strange
      failures.  One such problematic scenario is mentioned below.
      
       T1. CPU-1 is context switching from mm1 to mm2 context and got a NMI
           etc between the point of clearing the cpu from the mm_cpumask(mm1)
           and before reloading the cr3 with the new mm2.
      
       T2. CPU-2 is tearing down a specific vma for mm1 and will proceed with
           flushing the TLB for mm1.  It doesn't send the flush TLB to CPU-1
           as it doesn't see that cpu listed in the mm_cpumask(mm1).
      
       T3. After the TLB flush is complete, CPU-2 goes ahead and frees the
           page-table pages associated with the removed vma mapping.
      
       T4. CPU-2 now allocates those freed page-table pages for something
           else.
      
       T5. As the CR3 and TLB caches for mm1 is still active on CPU-1, CPU-1
           can potentially speculate and walk through the page-table caches
           and can insert new TLB entries.  As the page-table pages are
           already freed and being used on CPU-2, this page walk can
           potentially insert a bogus global TLB entry depending on the
           (random) contents of the page that is being used on CPU-2.
      
       T6. This bogus TLB entry being global will be active across future CR3
           changes and can result in weird memory corruption etc.
      
      To avoid this issue, for the prev mm that is handing over the cpu to
      another mm, clear the cpu from the mm_cpumask(prev) after the cr3 is
      changed.
      
      Marking it for -stable, though we haven't seen any reported failure that
      can be attributed to this.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org	[v2.6.32+]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      831d52bc
  10. 03 2月, 2011 2 次提交
    • S
      x86, mtrr: Avoid MTRR reprogramming on BP during boot on UP platforms · f7448548
      Suresh Siddha 提交于
      Markus Kohn ran into a hard hang regression on an acer aspire
      1310, when acpi is enabled. git bisect showed the following
      commit as the bad one that introduced the boot regression.
      
      	commit d0af9eed
      	Author: Suresh Siddha <suresh.b.siddha@intel.com>
      	Date:   Wed Aug 19 18:05:36 2009 -0700
      
      	    x86, pat/mtrr: Rendezvous all the cpus for MTRR/PAT init
      
      Because of the UP configuration of that platform,
      native_smp_prepare_cpus() bailed out (in smp_sanity_check())
      before doing the set_mtrr_aps_delayed_init()
      
      Further down the boot path, native_smp_cpus_done() will call the
      delayed MTRR initialization for the AP's (mtrr_aps_init()) with
      mtrr_aps_delayed_init not set. This resulted in the boot
      processor reprogramming its MTRR's to the values seen during the
      start of the OS boot. While this is not needed ideally, this
      shouldn't have caused any side-effects. This is because the
      reprogramming of MTRR's (set_mtrr_state() that gets called via
      set_mtrr()) will check if the live register contents are
      different from what is being asked to write and will do the actual
      write only if they are different.
      
      BP's mtrr state is read during the start of the OS boot and
      typically nothing would have changed when we ask to reprogram it
      on BP again because of the above scenario on an UP platform. So
      on a normal UP platform no reprogramming of BP MTRR MSR's
      happens and all is well.
      
      However, on this platform, bios seems to be modifying the fixed
      mtrr range registers between the start of OS boot and when we
      double check the live registers for reprogramming BP MTRR
      registers. And as the live registers are modified, we end up
      reprogramming the MTRR's to the state seen during the start of
      the OS boot.
      
      During ACPI initialization, something in the bios (probably smi
      handler?) don't like this fact and results in a hard lockup.
      
      We didn't see this boot hang issue on this platform before the
      commit d0af9eed, because only
      the AP's (if any) will program its MTRR's to the value that BP
      had at the start of the OS boot.
      
      Fix this issue by checking mtrr_aps_delayed_init before
      continuing further in the mtrr_aps_init(). Now, only AP's (if
      any) will program its MTRR's to the BP values during boot.
      
      Addresses https://bugzilla.novell.com/show_bug.cgi?id=623393
      
        [ By the way, this behavior of the bios modifying MTRR's after the start
          of the OS boot is not common and the kernel is not prepared to
          handle this situation well. Irrespective of this issue, during
          suspend/resume, linux kernel will try to reprogram the BP's MTRR values
          to the values seen during the start of the OS boot. So suspend/resume might
          be already broken on this platform for all linux kernel versions. ]
      Reported-and-bisected-by: NMarkus Kohn <jabber@gmx.org>
      Tested-by: NMarkus Kohn <jabber@gmx.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Thomas Renninger <trenn@novell.com>
      Cc: Rafael Wysocki <rjw@novell.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: stable@kernel.org # [v2.6.32+]
      LKML-Reference: <1296694975.4418.402.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f7448548
    • M
      x86, nx: Don't force pages RW when setting NX bits · f12d3d04
      Matthieu CASTET 提交于
      Xen want page table pages read only.
      
      But the initial page table (from head_*.S) live in .data or .bss.
      
      That was broken by 64edc8ed.  There is
      absolutely no reason to force these pages RW after they have already
      been marked RO.
      Signed-off-by: NMatthieu CASTET <castet.matthieu@free.fr>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      f12d3d04
  11. 01 2月, 2011 1 次提交