1. 30 3月, 2018 3 次提交
  2. 27 3月, 2018 3 次提交
    • A
      powerpc/mm: Fix typo in comments · b574df94
      Alexey Kardashevskiy 提交于
      Fixes: 912cc87a "powerpc/mm/radix: Add LPID based tlb flush helpers"
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      b574df94
    • M
      powerpc/mm: Fix section mismatch warning in stop_machine_change_mapping() · bde709a7
      Mauricio Faria de Oliveira 提交于
      Fix the warning messages for stop_machine_change_mapping(), and a number
      of other affected functions in its call chain.
      
      All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit
      is okay (keeps them / does not discard them).
      
      Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu.
      
          $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux
          ...
            MODPOST vmlinux.o
          WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping()
          The function stop_machine_change_mapping() references
          the function __meminit create_physical_mapping().
          This is often because stop_machine_change_mapping lacks a __meminit
          annotation or the annotation of create_physical_mapping is wrong.
      
          WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping()
          The function stop_machine_change_mapping() references
          the function __meminit create_physical_mapping().
          This is often because stop_machine_change_mapping lacks a __meminit
          annotation or the annotation of create_physical_mapping is wrong.
          ...
      Signed-off-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      bde709a7
    • P
      powerpc/64: Call H_REGISTER_PROC_TBL when running as a HPT guest on POWER9 · dbfcf3cb
      Paul Mackerras 提交于
      On POWER9, since commit cc3d2940 ("powerpc/64: Enable use of radix
      MMU under hypervisor on POWER9", 2017-01-30), we set both the radix and
      HPT bits in the client-architecture-support (CAS) vector, which tells
      the hypervisor that we can do either radix or HPT.  According to PAPR,
      if we use this combination we are promising to do a H_REGISTER_PROC_TBL
      hcall later on to let the hypervisor know whether we are doing radix
      or HPT.  We currently do this call if we are doing radix but not if
      we are doing HPT.  If the hypervisor is able to support both radix
      and HPT guests, it would be entitled to defer allocation of the HPT
      until the H_REGISTER_PROC_TBL call, and to fail any attempts to create
      HPTEs until the H_REGISTER_PROC_TBL call.  Thus we need to do a
      H_REGISTER_PROC_TBL call when we are doing HPT; otherwise we may
      crash at boot time.
      
      This adds the code to call H_REGISTER_PROC_TBL in this case, before
      we attempt to create any HPT entries using H_ENTER.
      
      Fixes: cc3d2940 ("powerpc/64: Enable use of radix MMU under hypervisor on POWER9")
      Cc: stable@vger.kernel.org # v4.11+
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      dbfcf3cb
  3. 23 3月, 2018 5 次提交
  4. 13 3月, 2018 11 次提交
  5. 06 3月, 2018 4 次提交
    • C
      powerpc/mm/slice: Allow up to 64 low slices · 15472423
      Christophe Leroy 提交于
      While the implementation of the "slices" address space allows
      a significant amount of high slices, it limits the number of
      low slices to 16 due to the use of a single u64 low_slices_psize
      element in struct mm_context_t
      
      On the 8xx, the minimum slice size is the size of the area
      covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
      16K pages mode. This means we could have at least 64 slices.
      
      In order to override this limitation, this patch switches the
      handling of low_slices_psize to char array as done already for
      high_slices_psize.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      15472423
    • C
      powerpc/mm/slice: Fix hugepage allocation at hint address on 8xx · aa0ab02b
      Christophe Leroy 提交于
      On the 8xx, the page size is set in the PMD entry and applies to
      all pages of the page table pointed by the said PMD entry.
      
      When an app has some regular pages allocated (e.g. see below) and tries
      to mmap() a huge page at a hint address covered by the same PMD entry,
      the kernel accepts the hint allthough the 8xx cannot handle different
      page sizes in the same PMD entry.
      
      10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
      10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
      
      mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
      
      This results the app remaining forever in do_page_fault()/hugetlb_fault()
      and when interrupting that app, we get the following warning:
      
      [162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W       4.14.6 #85
      [162980.035744] task: c67e2c00 task.stack: c668e000
      [162980.035783] NIP:  c000fe18 LR: c00e1eec CTR: c00f90c0
      [162980.035830] REGS: c668fc20 TRAP: 0700   Tainted: G W        (4.14.6)
      [162980.035854] MSR:  00029032 <EE,ME,IR,DR,RI>  CR: 24044224 XER: 20000000
      [162980.036003]
      [162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
      [162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
      [162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
      [162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
      [162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
      [162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
      [162980.036861] Call Trace:
      [162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
      [162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
      [162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
      [162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
      [162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
      [162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
      [162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
      [162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
      [162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
      [162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
      [162980.037781] Instruction dump:
      [162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
      [162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
      [162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
      [162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
      [162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
      
      In order to fix this, this patch uses the address space "slices"
      implemented for BOOK3S/64 and enhanced to support PPC32 by the
      preceding patch.
      
      This patch modifies the context.id on the 8xx to be in the range
      [1:16] instead of [0:15] in order to identify context.id == 0 as
      not initialised contexts as done on BOOK3S
      
      This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
      selected for the 8xx
      
      Alltough we could in theory have as many slices as PMD entries, the
      current slices implementation limits the number of low slices to 16.
      This limitation is not preventing us to fix the initial issue allthough
      it is suboptimal. It will be cured in a subsequent patch.
      
      Fixes: 4b914286 ("powerpc/8xx: Implement support of hugepages")
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aa0ab02b
    • C
      powerpc/mm/slice: Enhance for supporting PPC32 · db3a528d
      Christophe Leroy 提交于
      In preparation for the following patch which will fix an issue on
      the 8xx by re-using the 'slices', this patch enhances the
      'slices' implementation to support 32 bits CPUs.
      
      On PPC32, the address space is limited to 4Gbytes, hence only the low
      slices will be used.
      
      The high slices use bitmaps. As bitmap functions are not prepared to
      handle bitmaps of size 0, this patch ensures that bitmap functions
      are called only when SLICE_NUM_HIGH is not nul.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      db3a528d
    • C
      powerpc/mm/slice: Remove intermediate bitmap copy · 326691ad
      Christophe Leroy 提交于
      bitmap_or() and bitmap_andnot() can work properly with dst identical
      to src1 or src2. There is no need of an intermediate result bitmap
      that is copied back to dst in a second step.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      326691ad
  6. 23 2月, 2018 1 次提交
    • B
      powerpc/mm/drmem: Fix unexpected flag value in ibm,dynamic-memory-v2 · 2f7d03e0
      Bharata B Rao 提交于
      Memory addtion and removal by count and indexed-count methods
      temporarily mark the LMBs that are being added/removed by a special
      flag value DRMEM_LMB_RESERVED. Accessing flags value directly at a few
      places without proper accessor method is causing two unexpected
      side-effects:
      
      - DRMEM_LMB_RESERVED bit is becoming part of the flags word of
        drconf_cell_v2 entries in ibm,dynamic-memory-v2 DT property.
      - This results in extra drconf_cell entries in ibm,dynamic-memory-v2.
        For example if 1G memory is added, it leads to one entry for 3 LMBs
        and 1 separate entry for the last LMB. All the 4 LMBs should be
        defined by one entry here.
      
      Fix this by always accessing the flags by its accessor method
      drmem_lmb_flags().
      
      Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property")
      Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2f7d03e0
  7. 16 2月, 2018 1 次提交
    • N
      powerpc/pseries: Check for zero filled ibm,dynamic-memory property · 2c10636a
      Nathan Fontenot 提交于
      Some versions of QEMU will produce an ibm,dynamic-reconfiguration-memory
      node with a ibm,dynamic-memory property that is zero-filled. This
      causes the drmem code to oops trying to parse this property.
      
      The fix for this is to validate that the property does contain LMB
      entries before trying to parse it and bail if the count is zero.
      
        Oops: Kernel access of bad area, sig: 11 [#1]
        DAR: 0000000000000010
        NIP read_drconf_v1_cell+0x54/0x9c
        LR  read_drconf_v1_cell+0x48/0x9c
        Call Trace:
          __param_initcall_debug+0x0/0x28 (unreliable)
          drmem_init+0x144/0x2f8
          do_one_initcall+0x64/0x1d0
          kernel_init_freeable+0x298/0x38c
          kernel_init+0x24/0x160
          ret_from_kernel_thread+0x5c/0xb4
      
      The ibm,dynamic-reconfiguration-memory device tree property generated
      that causes this:
      
        ibm,dynamic-reconfiguration-memory {
                ibm,lmb-size = <0x0 0x10000000>;
                ibm,memory-flags-mask = <0xff>;
                ibm,dynamic-memory = <0x0 0x0 0x0 0x0 0x0 0x0>;
                linux,phandle = <0x7e57eed8>;
                ibm,associativity-lookup-arrays = <0x1 0x4 0x0 0x0 0x0 0x0>;
                ibm,memory-preservation-time = <0x0>;
        };
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Reviewed-by: NCyril Bur <cyrilbur@gmail.com>
      Tested-by: NDaniel Black <daniel@linux.vnet.ibm.com>
      [mpe: Trim oops report]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      2c10636a
  8. 13 2月, 2018 3 次提交
  9. 08 2月, 2018 3 次提交
    • B
      powerpc/mm/radix: Split linear mapping on hot-unplug · 4dd5f8a9
      Balbir Singh 提交于
      This patch splits the linear mapping if the hot-unplug range is
      smaller than the mapping size. The code detects if the mapping needs
      to be split into a smaller size and if so, uses the stop machine
      infrastructure to clear the existing mapping and then remap the
      remaining range using a smaller page size.
      
      The code will skip any region of the mapping that overlaps with kernel
      text and warn about it once. We don't want to remove a mapping where
      the kernel text and the LMB we intend to remove overlap in the same
      TLB mapping as it may affect the currently executing code.
      
      I've tested these changes under a kvm guest with 2 vcpus, from a split
      mapping point of view, some of the caveats mentioned above applied to
      the testing I did.
      
      Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Tweak change log to match updated behaviour]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4dd5f8a9
    • N
      powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PID · eeb715c3
      Nicholas Piggin 提交于
      This change restores and formalises the behaviour that access to NULL
      or other user addresses by the kernel during boot should fault rather
      than succeed and modify memory. This was inadvertently broken when
      fixing another bug, because it was previously not well defined and
      only worked by chance.
      
      powerpc/64s/radix uses high address bits to select an address space
      "quadrant", which determines which PID and LPID are used to translate
      the rest of the address (effective PID, effective LPID). The kernel
      mapping at 0xC... selects quadrant 3, which uses PID=0 and LPID=0. So
      the kernel page tables are installed in the PID 0 process table entry.
      
      An address at 0x0... selects quadrant 0, which uses PID=PIDR for
      translating the rest of the address (that is, it uses the value of the
      PIDR register as the effective PID). If PIDR=0, then the translation
      is performed with the PID 0 process table entry page tables. This is
      the kernel mapping, so we effectively get another copy of the kernel
      address space at 0. A NULL pointer access will access physical memory
      address 0.
      
      To prevent duplicating the kernel address space in quadrant 0, this
      patch allocates a guard PID containing no translations, and
      initializes PIDR with this during boot, before the MMU is switched on.
      Any kernel access to quadrant 0 will use this guard PID for
      translation and find no valid mappings, and therefore fault.
      
      After boot, this PID will be switchd away to user context PIDs, but
      those contain user mappings (and usually NULL pointer protection)
      rather than kernel mapping, which is much safer (and by design). It
      may be in future this is tightened further, which the guard PID could
      be used for.
      
      Commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before
      setting partition table"), introduced this problem because it zeroes
      PIDR at boot. However previously the value was inherited from firmware
      or kexec, which is not robust and can be zero (e.g., mambo).
      
      Fixes: 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table")
      Cc: stable@vger.kernel.org # v4.15+
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Tested-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      eeb715c3
    • N
      powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove · 1d9a0907
      Nathan Fontenot 提交于
      When DLPAR removing a CPU, the unmapping of the cpu from a node in
      unmap_cpu_from_node() should also invalidate the CPUs entry in the
      numa_cpu_lookup_table. There is not a guarantee that on a subsequent
      DLPAR add of the CPU the associativity will be the same and thus
      could be in a different node. Invalidating the entry in the
      numa_cpu_lookup_table causes the associativity to be read from the
      device tree at the time of the add.
      
      The current behavior of not invalidating the CPUs entry in the
      numa_cpu_lookup_table can result in scenarios where the the topology
      layout of CPUs in the partition does not match the device tree
      or the topology reported by the HMC.
      
      This bug looks like it was introduced in 2004 in the commit titled
      "ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the
      linux-fullhist tree. Hence tag it for all stable releases.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Reviewed-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d9a0907
  10. 06 2月, 2018 1 次提交
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  11. 01 2月, 2018 2 次提交
  12. 27 1月, 2018 3 次提交
    • M
      powerpc/pseries: Fix cpu hotplug crash with memoryless nodes · e67e02a5
      Michael Bringmann 提交于
      On powerpc systems with shared configurations of CPUs and memory and
      memoryless nodes at boot, an event ordering problem was observed on a
      SLES12 build platforms with the hot-add of CPUs to the memoryless
      nodes.
      
      * The most common error occurred when the memory SLAB driver attempted
        to reference the memoryless node to which a CPU was being added
        before the kernel had finished initializing all of the data
        structures for the CPU and exited 'device_online' under
        DLPAR/hot-add.
      
        Normally the memoryless node would be initialized through the call
        path device_online ... arch_update_cpu_topology ... find_cpu_nid ...
        try_online_node. This patch ensures that the powerpc node will be
        initialized as early as possible, even if it was memoryless and
        CPU-less at the point when we are trying to hot-add a new CPU to it.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e67e02a5
    • M
      powerpc/numa: Ensure nodes initialized for hotplug · ea05ba7c
      Michael Bringmann 提交于
      This patch fixes some problems encountered at runtime with
      configurations that support memory-less nodes, or that hot-add CPUs
      into nodes that are memoryless during system execution after boot. The
      problems of interest include:
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node.
      
      * Nodes which have no resources assigned at boot, but which may still
        be referenced subsequently by affinity or associativity attributes,
        are kept in the list of 'possible' nodes for powerpc. Hot-add of
        memory or CPUs to the system can reference these nodes and bring
        them online instead of redirecting the references to one of the set
        of nodes known to have memory at boot.
      
      Note that this software operates under the context of CPU hotplug. We
      are not doing memory hotplug in this code, but rather updating the
      kernel's CPU topology (i.e. arch_update_cpu_topology /
      numa_update_cpu_topology). We are initializing a node that may be used
      by CPUs or memory before it can be referenced as invalid by a CPU
      hotplug operation. CPU hotplug operations are protected by a range of
      APIs including cpu_maps_update_begin/cpu_maps_update_done,
      cpus_read/write_lock / cpus_read/write_unlock, device locks, and more.
      Memory hotplug operations, including try_online_node, are protected by
      mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the
      case of CPUs being hot-added to a previously memoryless node, the
      try_online_node operation occurs wholly within the CPU locks with no
      overlap. Using HMC hot-add/hot-remove operations, we have been able to
      add and remove CPUs to any possible node without failures. HMC
      operations involve a degree self-serialization, though.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea05ba7c
    • M
      powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes · a346137e
      Michael Bringmann 提交于
      On powerpc systems which allow 'hot-add' of CPU or memory resources,
      it may occur that the new resources are to be inserted into nodes that
      were not used for these resources at bootup. In the kernel, any node
      that is used must be defined and initialized. These empty nodes may
      occur when,
      
      * Dedicated vs. shared resources. Shared resources require information
        such as the VPHN hcall for CPU assignment to nodes. Associativity
        decisions made based on dedicated resource rules, such as
        associativity properties in the device tree, may vary from decisions
        made using the values returned by the VPHN hcall.
      
      * memoryless nodes at boot. Nodes need to be defined as 'possible' at
        boot for operation with other code modules. Previously, the powerpc
        code would limit the set of possible nodes to those which have
        memory assigned at boot, and were thus online. Subsequent add/remove
        of CPUs or memory would only work with this subset of possible
        nodes.
      
      * memoryless nodes with CPUs at boot. Due to the previous restriction
        on nodes, nodes that had CPUs but no memory were being collapsed
        into other nodes that did have memory at boot. In practice this
        meant that the node assignment presented by the runtime kernel
        differed from the affinity and associativity attributes presented by
        the device tree or VPHN hcalls. Nodes that might be known to the
        pHyp were not 'possible' in the runtime kernel because they did not
        have memory at boot.
      
      This patch ensures that sufficient nodes are defined to support
      configuration requirements after boot, as well as at boot. This patch
      set fixes a couple of problems.
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node. * Nodes which have no
        resources assigned at boot, but which may still be referenced
        subsequently by affinity or associativity attributes, are kept in
        the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs
        to the system can reference these nodes and bring them online
        instead of redirecting to one of the set of nodes that were known to
        have memory at boot.
      
      This patch extracts the value of the lowest domain level (number of
      allocable resources) from the device tree property
      "ibm,max-associativity-domains" to use as the maximum number of nodes
      to setup as possibly available in the system. This new setting will
      override the instruction:
      
          nodes_and(node_possible_map, node_possible_map, node_online_map);
      
      presently seen in the function arch/powerpc/mm/numa.c:initmem_init().
      
      If the "ibm,max-associativity-domains" property is not present at
      boot, no operation will be performed to define or enable additional
      nodes, or enable the above 'nodes_and()'.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a346137e