1. 13 5月, 2011 1 次提交
    • S
      x86,xen: introduce x86_init.mapping.pagetable_reserve · 279b706b
      Stefano Stabellini 提交于
      Introduce a new x86_init hook called pagetable_reserve that at the end
      of init_memory_mapping is used to reserve a range of memory addresses for
      the kernel pagetable pages we used and free the other ones.
      
      On native it just calls memblock_x86_reserve_range while on xen it also
      takes care of setting the spare memory previously allocated
      for kernel pagetable pages from RO to RW, so that it can be used for
      other purposes.
      
      A detailed explanation of the reason why this hook is needed follows.
      
      As a consequence of the commit:
      
      commit 4b239f45
      Author: Yinghai Lu <yinghai@kernel.org>
      Date:   Fri Dec 17 16:58:28 2010 -0800
      
          x86-64, mm: Put early page table high
      
      at some point init_memory_mapping is going to reach the pagetable pages
      area and map those pages too (mapping them as normal memory that falls
      in the range of addresses passed to init_memory_mapping as argument).
      Some of those pages are already pagetable pages (they are in the range
      pgt_buf_start-pgt_buf_end) therefore they are going to be mapped RO and
      everything is fine.
      Some of these pages are not pagetable pages yet (they fall in the range
      pgt_buf_end-pgt_buf_top; for example the page at pgt_buf_end) so they
      are going to be mapped RW.  When these pages become pagetable pages and
      are hooked into the pagetable, xen will find that the guest has already
      a RW mapping of them somewhere and fail the operation.
      The reason Xen requires pagetables to be RO is that the hypervisor needs
      to verify that the pagetables are valid before using them. The validation
      operations are called "pinning" (more details in arch/x86/xen/mmu.c).
      
      In order to fix the issue we mark all the pages in the entire range
      pgt_buf_start-pgt_buf_top as RO, however when the pagetable allocation
      is completed only the range pgt_buf_start-pgt_buf_end is reserved by
      init_memory_mapping. Hence the kernel is going to crash as soon as one
      of the pages in the range pgt_buf_end-pgt_buf_top is reused (b/c those
      ranges are RO).
      
      For this reason we need a hook to reserve the kernel pagetable pages we
      used and free the other ones so that they can be reused for other
      purposes.
      On native it just means calling memblock_x86_reserve_range, on Xen it
      also means marking RW the pagetable pages that we allocated before but
      that haven't been used before.
      
      Another way to fix this is without using the hook is by adding a 'if
      (xen_pv_domain)' in the 'init_memory_mapping' code and calling the Xen
      counterpart, but that is just nasty.
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      279b706b
  2. 02 5月, 2011 1 次提交
    • Y
      x86, NUMA: Fix empty memblk detection in numa_cleanup_meminfo() · 2be19102
      Yinghai Lu 提交于
      numa_cleanup_meminfo() trims each memblk between low (0) and
      high (max_pfn) limits and discards empty ones.  However, the
      emptiness detection incorrectly used equality test.  If the
      start of a memblk is higher than max_pfn, it is empty but fails
      the equality test and doesn't get discarded.
      
      The condition triggers when max_pfn is lower than start of a
      NUMA node and results in memory misconfiguration - leading to
      WARN_ON()s and other funnies.  The bug was discovered in devel
      branch where 32bit too uses this code path for NUMA init.  If a
      node is above the addressing limit, max_pfn ends up lower than
      the node triggering this problem.
      
      The failure hasn't been observed on x86-64 but is still possible
      with broken hardware e820/NUMA info.  As the fix is very low
      risk, it would be better to apply it even for 64bit.
      
      Fix it by using >= instead of ==.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      [ Extracted the actual fix from the original patch and rewrote patch description. ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110501171204.GO29280@htj.dyndns.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      2be19102
  3. 21 4月, 2011 1 次提交
  4. 04 4月, 2011 1 次提交
    • T
      x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change · 765af22d
      Tejun Heo 提交于
      Commit d8fc3afc (x86, NUMA: Move *_numa_init() invocations
      into initmem_init()) moved acpi_numa_init() call into NUMA
      initmem_init() but forgot to update 32bit NUMA init breaking ACPI
      NUMA configuration for 32bit.
      
      acpi_numa_init() call was later moved again to srat_64.c.  Match
      it by adding the call to get_memcfg_from_srat() in srat_32.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      LKML-Reference: <20110404100645.GE1420@mtj.dyndns.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      765af22d
  5. 24 3月, 2011 3 次提交
  6. 20 3月, 2011 1 次提交
    • Y
      x86: Cleanup highmap after brk is concluded · e5f15b45
      Yinghai Lu 提交于
      Now cleanup_highmap actually is in two steps: one is early in head64.c
      and only clears above _end; a second one is in init_memory_mapping() and
      tries to clean from _brk_end to _end.
      It should check if those boundaries are PMD_SIZE aligned but currently
      does not.
      Also init_memory_mapping() is called several times for numa or memory
      hotplug, so we really should not handle initial kernel mappings there.
      
      This patch moves cleanup_highmap() down after _brk_end is settled so
      we can do everything in one step.
      Also we honor max_pfn_mapped in the implementation of cleanup_highmap.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
      LKML-Reference: <alpine.DEB.2.00.1103171739050.3382@kaball-desktop>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      e5f15b45
  7. 18 3月, 2011 2 次提交
    • S
      x86: Flush TLB if PGD entry is changed in i386 PAE mode · 4981d01e
      Shaohua Li 提交于
      According to intel CPU manual, every time PGD entry is changed in i386 PAE
      mode, we need do a full TLB flush. Current code follows this and there is
      comment for this too in the code.
      
      But current code misses the multi-threaded case. A changed page table
      might be used by several CPUs, every such CPU should flush TLB. Usually
      this isn't a problem, because we prepopulate all PGD entries at process
      fork. But when the process does munmap and follows new mmap, this issue
      will be triggered.
      
      When it happens, some CPUs keep doing page faults:
      
        http://marc.info/?l=linux-kernel&m=129915020508238&w=2
      
      Reported-by: Yasunori Goto<y-goto@jp.fujitsu.com>
      Tested-by: Yasunori Goto<y-goto@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Cc: Mallick Asit K <asit.k.mallick@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: stable <stable@kernel.org>
      LKML-Reference: <1300246649.2337.95.camel@sli10-conroe>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4981d01e
    • L
      x86: Fix common misspellings · 0d2eb44f
      Lucas De Marchi 提交于
      They were generated by 'codespell' and then manually reviewed.
      Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
      Cc: trivial@kernel.org
      LKML-Reference: <1300389856-1099-3-git-send-email-lucas.demarchi@profusion.mobi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d2eb44f
  8. 15 3月, 2011 1 次提交
  9. 12 3月, 2011 1 次提交
  10. 10 3月, 2011 2 次提交
    • A
      x86/mm: Fix pgd_lock deadlock · a79e53d8
      Andrea Arcangeli 提交于
      It's forbidden to take the page_table_lock with the irq disabled
      or if there's contention the IPIs (for tlb flushes) sent with
      the page_table_lock held will never run leading to a deadlock.
      
      Nobody takes the pgd_lock from irq context so the _irqsave can be
      removed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Tested-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <201102162345.p1GNjMjm021738@imap1.linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a79e53d8
    • A
      x86/mm: Handle mm_fault_error() in kernel space · f8626854
      Andrey Vagin 提交于
      mm_fault_error() should not execute oom-killer, if page fault
      occurs in kernel space.  E.g. in copy_from_user()/copy_to_user().
      
      This would happen if we find ourselves in OOM on a
      copy_to_user(), or a copy_from_user() which faults.
      
      Without this patch, the kernels hangs up in copy_from_user(),
      because OOM killer sends SIG_KILL to current process, but it
      can't handle a signal while in syscall, then the kernel returns
      to copy_from_user(), reexcute current command and provokes
      page_fault again.
      
      With this patch the kernel return -EFAULT from copy_from_user().
      
      The code, which checks that page fault occurred in kernel space,
      has been copied from do_sigbus().
      
      This situation is handled by the same way on powerpc, xtensa,
      tile, ...
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      LKML-Reference: <201103092322.p29NMNPH001682@imap1.linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f8626854
  11. 04 3月, 2011 5 次提交
    • T
      x86-64, NUMA: Don't assume phys node 0 is always online in numa_emulation() · 078a1989
      Tejun Heo 提交于
      Undetermined entries in emu_nid_to_phys[] are filled with zero
      assuming that physical node 0 is always online; however, this might
      not be true depending on hardware configuration.  Find a physical node
      which is actually online and use it instead.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      LKML-Reference: <alpine.DEB.2.00.1103020628210.31626@chino.kir.corp.google.com>
      078a1989
    • Y
      x86, numa: Fix numa_emulation code with memory-less node0 · 3b28cf32
      Yinghai Lu 提交于
      This crash happens on a system that does not have RAM on node0.
      
      When numa_emulation is compiled in, and:
      
       1. we boot the system without numa=fake...
       2. or we boot the system with numa=fake=128 to make emulation fail
      
      we will get:
      
      [    0.076025] ------------[ cut here ]------------
      [    0.080004] kernel BUG at arch/x86/mm/numa_64.c:788!
      [    0.080004] invalid opcode: 0000 [#1] SMP
      [...]
      
      need to use early_cpu_to_node() directly, because cpu_to_apicid
      and apicid_to_node will return node0 that is not onlined.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      LKML-Reference: <4D6ECF72.5010308@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3b28cf32
    • D
      x86-64, NUMA: Clean up initmem_init() · c09cedf4
      David Rientjes 提交于
      This patch cleans initmem_init() so that it is more readable and doesn't
      use an unnecessary array of function pointers to convolute the flow of
      the code.  It also makes it obvious that dummy_numa_init() will always
      succeed (and documents that requirement) so that the existing BUG() is
      never actually reached.
      
      No functional change.
      
      -tj: Updated comment for dummy_numa_init() slightly.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c09cedf4
    • Y
      x86-64, NUMA: Fix numa_emulation code with node0 without RAM · 51b361b4
      Yinghai Lu 提交于
      On one system that does not have RAM on node0.
      
      When numa_emulation is compiled in, and
      1. boot system without numa=fake...
      2. or boot system with numa=fake=128 to make emulation fail
      
      will get:
      
      [    0.092026] ------------[ cut here ]------------
      [    0.096005] kernel BUG at arch/x86/mm/numa_emulation.c:439!
      [    0.096005] invalid opcode: 0000 [#1] SMP
      [    0.096005] last sysfs file:
      [    0.096005] CPU 0
      [    0.096005] Modules linked in:
      [    0.096005]
      [    0.096005] Pid: 0, comm: swapper Not tainted 2.6.38-rc6-tip-yh-03869-gcb0491d-dirty #684 Sun Microsystems     Sun Fire X4240/Sun Fire X4240
      [    0.096005] RIP: 0010:[<ffffffff81cdc65b>]  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
      [    0.096005] RSP: 0000:ffffffff82437ed8  EFLAGS: 00010246
      ...
      [    0.096005] Call Trace:
      [    0.096005]  [<ffffffff81cd7931>] identify_cpu+0x2d7/0x2df
      [    0.096005]  [<ffffffff827e54fa>] identify_boot_cpu+0x10/0x30
      [    0.096005]  [<ffffffff827e5704>] check_bugs+0x9/0x2d
      [    0.096005]  [<ffffffff827dceda>] start_kernel+0x3d7/0x3f1
      [    0.096005]  [<ffffffff827dc2cc>] x86_64_start_reservations+0x9c/0xa0
      [    0.096005]  [<ffffffff827dc4ad>] x86_64_start_kernel+0x1dd/0x1e8
      [    0.096005] Code: 74 06 48 8d 04 90 eb 0f 48 c7 c0 30 d9 00 00 48 03 04 d5 90 0f 60 82 8b 00 83 f8 ff 74 0d 0f a3 05 8b 7e 92 00 19 d2 85 d2 75 02 <0f> 0b 48 98 be 00 01 00 00 48 c7 c7 e0 44 60 82 44 8b 2c 85 e0
      [    0.096005] RIP  [<ffffffff81cdc65b>] numa_add_cpu+0x56/0xcf
      [    0.096005]  RSP <ffffffff82437ed8>
      [    0.096026] ---[ end trace a7919e7f17c0a725 ]---
      
      We need to use early_cpu_to_node() directly, because numa_cpu_node()
      will return node0 that is not onlined.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      51b361b4
    • T
      x86-64, NUMA: Revert NUMA affine page table allocation · f8911250
      Tejun Heo 提交于
      This patch reverts NUMA affine page table allocation added by commit
      1411e0ec (x86-64, numa: Put pgtable to local node memory).
      
      The commit made an undocumented change where the kernel linear mapping
      strictly follows intersection of e820 memory map and NUMA
      configuration.  If the physical memory configuration has holes or NUMA
      nodes are not properly aligned, this leads to using unnecessarily
      smaller mapping size which leads to increased TLB pressure.  For
      details,
      
        http://thread.gmane.org/gmane.linux.kernel/1104672
      
      Patches to fix the problem have been proposed but the underlying code
      needs more cleanup and the approach itself seems a bit heavy handed
      and it has been determined to revert the feature for now and come back
      to it in the next developement cycle.
      
        http://thread.gmane.org/gmane.linux.kernel/1105959
      
      As init_memory_mapping_high() callsites have been consolidated since
      the commit, reverting is done manually.  Also, the RED-PEN comment in
      arch/x86/mm/init.c is not restored as the problem no longer exists
      with memblock based top-down early memory allocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      f8911250
  12. 02 3月, 2011 2 次提交
    • T
      x86-64, NUMA: Better explain numa_distance handling · eb8c1e2c
      Tejun Heo 提交于
      Handling of out-of-bounds distances and allocation failure can use
      better documentation.  Add it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      eb8c1e2c
    • Y
      x86-64, NUMA: Fix distance table handling · ce003330
      Yinghai Lu 提交于
      NUMA distance table handling has the following problems.
      
      * numa_reset_distance() uses numa_distance * sizeof(numa_distance[0])
        as the table size when it should be using the square of
        numa_distance.
      
      * The same size miscalculation when allocation space for phys_dist in
        numa_emulation().
      
      * In numa_emulation(), phys_dist must be reserved; otherwise, the new
        emulated distance table may overlap it.
      
      Fix them and, while at it, take numa_distance_cnt resetting in
      numa_reset_distance() out of the if block to simplify the code a bit.
      
      David Rientjes reported incorrect handling of distance table during
      emulation.
      
      -tj: Edited out numa_alloc_distance() related changes which weren't
           necessary and rewrote patch description.
      
      -v2: Ingo was unhappy with 80-column limit induced linebreaks.  Let
           lines run over 80-column.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      ce003330
  13. 25 2月, 2011 1 次提交
    • D
      x86-64, NUMA: Fix size of numa_distance array · 1f565a89
      David Rientjes 提交于
      numa_distance should be sized like the SLIT, an NxN matrix where N is
      the highest node id + 1.  This patch fixes the calculation to avoid
      overflowing the array on the subsequent iteration.
      
      -tj: The original patch used last index to calculate size.  Yinghai
           pointed out it should be incremented so it is the number of
           elements instead of the last index to calculate the size of the
           table.  Updated accordingly.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f565a89
  14. 24 2月, 2011 1 次提交
    • Y
      x86: Rename e820_table_* to pgt_buf_* · d1b19426
      Yinghai Lu 提交于
      e820_table_{start|end|top}, which are used to buffer page table
      allocation during early boot, are now derived from memblock and don't
      have much to do with e820.  Change the names so that they reflect what
      they're used for.
      
      This patch doesn't introduce any behavior change.
      
      -v2: Ingo found that earlier patch "x86: Use early pre-allocated page
           table buffer top-down" caused crash on 32bit and needed to be
           dropped.  This patch was updated to reflect the change.
      
      -tj: Updated commit description.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d1b19426
  15. 22 2月, 2011 4 次提交
  16. 21 2月, 2011 1 次提交
  17. 17 2月, 2011 12 次提交
    • Y
      x86-64, NUMA: Put dummy_numa_init() in the init section · 6d496f9f
      Yinghai Lu 提交于
      dummy_numa_init() is used only during system boot.  Put it in .init
      like other NUMA init functions.
      
      - tj: Description update.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6d496f9f
    • Y
      x86-64, NUMA: Don't call __pa() with invalid address in numa_reset_distance() · 2ca230ba
      Yinghai Lu 提交于
      Do not call __pa(numa_distance) if it was not allocated before.
      Calling with invalid address triggers VIRTUAL_BUG_ON() in
      __phys_addr() if CONFIG_DEBUG_VIRTUAL.
      
      Also reported by Ingo.
      
       http://thread.gmane.org/gmane.linux.kernel/1101306/focus=1101785
      
      - v2: Change to check existing path as tj requested.
      - tj: Description update.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      2ca230ba
    • T
      x86-64, NUMA: Unify emulated distance mapping · e23bba60
      Tejun Heo 提交于
      NUMA emulation needs to update node distance information.  It did it
      by remapping apicid to PXM mapping, even when amdtopology is being
      used.  There is no reason to go through such convolution.  The generic
      code has all the information necessary to transform the distance table
      to the emulated nid space.
      
      Implement generic distance table transformation in numa_emulation()
      and drop private implementations in srat_64 and amdtopology_64.  This
      makes find_node_by_addr() and fake_physnodes() and related functions
      unnecessary, drop them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      e23bba60
    • T
      x86-64, NUMA: Unify emulated apicid -> node mapping transformation · 6b78cb54
      Tejun Heo 提交于
      NUMA emulation changes node mappings and thus apicid -> node mapping
      needs to be updated accordingly.  srat_64 and amdtopology_64 did this
      separately; however, all the necessary information is the mapping from
      emulated nodes to physical nodes which is available in
      emu_nid_to_phys[].
      
      Implement common __apicid_to_node[] transformation in numa_emulation()
      and drop duplicate implementations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      6b78cb54
    • T
      x86-64, NUMA: Emulate directly from numa_meminfo · 1cca5340
      Tejun Heo 提交于
      NUMA emulation built physnodes[] array which could only represent
      configurations from the physical meminfo and emulated nodes using the
      information.  There's no reason to take this extra level of
      indirection.  Update emulation functions so that they operate directly
      on numa_meminfo.  This simplifies the code and makes emulation layout
      behave better with interleaved physical nodes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      1cca5340
    • T
      x86-64, NUMA: Wrap node ID during emulation · 775ee85d
      Tejun Heo 提交于
      Both emulation layout functions - split_nodes[_size]_interleave() -
      didn't wrap emulated nid while laying out the fake nodes and tried to
      avoid interating over the specified number of nodes, which is fragile.
      
      Now that the emulation code generates numa_meminfo, the node memblks
      don't need to be consecutive and emulated node IDs can simply wrap.
      This makes the code more robust and is necessary for updates to better
      handle the cases where the physical nodes are interleaved.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      775ee85d
    • T
      x86-64, NUMA: Make emulation code build numa_meminfo and share the registration path · c88aea7a
      Tejun Heo 提交于
      NUMA emulation code built nodes[] array and had its own registration
      path to set up the emulated nodes.  Update it such that it generates
      emulated numa_meminfo and returns control to initmem_init() and shares
      the same registration path with non-emulated cases.
      
      Because {acpi|amd}_fake_nodes() expect nodes[] parameter,
      fake_physnodes() now generates nodes[] from numa_meminfo.  This will
      go away with further updates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      c88aea7a
    • T
      x86-64, NUMA: Build and use direct emulated nid -> phys nid mapping · 9d073cae
      Tejun Heo 提交于
      NUMA emulation copied physical NUMA configuration into physnodes[] and
      used it to reverse-map emulated nodes to physical nodes, which is
      unnecessarily convoluted.  Build emu_nid_to_phys[] array to map
      emulated nids directly to the matching physical nids and use it in
      numa_add_cpu().
      
      physnodes[] will be removed with further patches.
      
      - v2: Build failure when CONFIG_DEBUG_PER_CPU_MAPS due to missing
        local variable definition fixed.  Reported by Ingo.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      9d073cae
    • T
      x86-64, NUMA: Trivial changes to prepare for emulation updates · d9c515ea
      Tejun Heo 提交于
      * Separate out numa_add_memblk_to() from numa_add_memblk() so that
        different numa_meminfo can be used.
      
      * Rename cmdline to emu_cmdline.
      
      * Drop @start/last_pfn from numa_emulation() and use max_pfn directly.
      
      This patch doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      d9c515ea
    • T
      x86-64, NUMA: Implement generic node distance handling · ac7136b6
      Tejun Heo 提交于
      Node distance either used direct node comparison, ACPI PXM comparison
      or ACPI SLIT table lookup.  This patch implements generic node
      distance handling.  NUMA init methods can call numa_set_distance() to
      set distance between nodes and the common __node_distance()
      implementation will report the set distance.
      
      Due to the way NUMA emulation is implemented, the generic node
      distance handling is used only when emulation is not used.  Later
      patches will update NUMA emulation to use the generic distance
      mechanism.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      ac7136b6
    • T
      x86-64, NUMA: Kill mem_nodes_parsed · 4697bdcc
      Tejun Heo 提交于
      With all memory configuration information now carried in numa_meminfo,
      there's no need to keep mem_nodes_parsed separate.  Drop it and use
      numa_nodes_parsed for CPU / memory-less nodes.
      
      A new helper numa_nodemask_from_meminfo() is added to calculate
      memnode mask on the fly which is currently used to set
      node_possible_map.
      
      This simplifies NUMA init methods a bit and removes a source of
      possible inconsistencies.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      4697bdcc
    • T
      x86-64, NUMA: Rename cpu_nodes_parsed to numa_nodes_parsed · 92d4a437
      Tejun Heo 提交于
      It's no longer necessary to keep both cpu_nodes_parsed and
      mem_nodes_parsed.  In preparation for merge, rename cpu_nodes_parsed
      to numa_nodes_parsed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Shaohui Zheng <shaohui.zheng@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      92d4a437