1. 13 2月, 2018 1 次提交
  2. 08 2月, 2018 3 次提交
    • B
      powerpc/mm/radix: Split linear mapping on hot-unplug · 4dd5f8a9
      Balbir Singh 提交于
      This patch splits the linear mapping if the hot-unplug range is
      smaller than the mapping size. The code detects if the mapping needs
      to be split into a smaller size and if so, uses the stop machine
      infrastructure to clear the existing mapping and then remap the
      remaining range using a smaller page size.
      
      The code will skip any region of the mapping that overlaps with kernel
      text and warn about it once. We don't want to remove a mapping where
      the kernel text and the LMB we intend to remove overlap in the same
      TLB mapping as it may affect the currently executing code.
      
      I've tested these changes under a kvm guest with 2 vcpus, from a split
      mapping point of view, some of the caveats mentioned above applied to
      the testing I did.
      
      Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()")
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Tweak change log to match updated behaviour]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4dd5f8a9
    • N
      powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PID · eeb715c3
      Nicholas Piggin 提交于
      This change restores and formalises the behaviour that access to NULL
      or other user addresses by the kernel during boot should fault rather
      than succeed and modify memory. This was inadvertently broken when
      fixing another bug, because it was previously not well defined and
      only worked by chance.
      
      powerpc/64s/radix uses high address bits to select an address space
      "quadrant", which determines which PID and LPID are used to translate
      the rest of the address (effective PID, effective LPID). The kernel
      mapping at 0xC... selects quadrant 3, which uses PID=0 and LPID=0. So
      the kernel page tables are installed in the PID 0 process table entry.
      
      An address at 0x0... selects quadrant 0, which uses PID=PIDR for
      translating the rest of the address (that is, it uses the value of the
      PIDR register as the effective PID). If PIDR=0, then the translation
      is performed with the PID 0 process table entry page tables. This is
      the kernel mapping, so we effectively get another copy of the kernel
      address space at 0. A NULL pointer access will access physical memory
      address 0.
      
      To prevent duplicating the kernel address space in quadrant 0, this
      patch allocates a guard PID containing no translations, and
      initializes PIDR with this during boot, before the MMU is switched on.
      Any kernel access to quadrant 0 will use this guard PID for
      translation and find no valid mappings, and therefore fault.
      
      After boot, this PID will be switchd away to user context PIDs, but
      those contain user mappings (and usually NULL pointer protection)
      rather than kernel mapping, which is much safer (and by design). It
      may be in future this is tightened further, which the guard PID could
      be used for.
      
      Commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before
      setting partition table"), introduced this problem because it zeroes
      PIDR at boot. However previously the value was inherited from firmware
      or kexec, which is not robust and can be zero (e.g., mambo).
      
      Fixes: 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table")
      Cc: stable@vger.kernel.org # v4.15+
      Reported-by: NFlorian Weimer <fweimer@redhat.com>
      Tested-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      eeb715c3
    • N
      powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove · 1d9a0907
      Nathan Fontenot 提交于
      When DLPAR removing a CPU, the unmapping of the cpu from a node in
      unmap_cpu_from_node() should also invalidate the CPUs entry in the
      numa_cpu_lookup_table. There is not a guarantee that on a subsequent
      DLPAR add of the CPU the associativity will be the same and thus
      could be in a different node. Invalidating the entry in the
      numa_cpu_lookup_table causes the associativity to be read from the
      device tree at the time of the add.
      
      The current behavior of not invalidating the CPUs entry in the
      numa_cpu_lookup_table can result in scenarios where the the topology
      layout of CPUs in the partition does not match the device tree
      or the topology reported by the HMC.
      
      This bug looks like it was introduced in 2004 in the commit titled
      "ppc64: cpu hotplug notifier for numa", which is 6b15e4e87e32 in the
      linux-fullhist tree. Hence tag it for all stable releases.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Reviewed-by: NTyrel Datwyler <tyreld@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d9a0907
  3. 06 2月, 2018 1 次提交
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  4. 01 2月, 2018 2 次提交
  5. 27 1月, 2018 4 次提交
    • M
      powerpc/pseries: Fix cpu hotplug crash with memoryless nodes · e67e02a5
      Michael Bringmann 提交于
      On powerpc systems with shared configurations of CPUs and memory and
      memoryless nodes at boot, an event ordering problem was observed on a
      SLES12 build platforms with the hot-add of CPUs to the memoryless
      nodes.
      
      * The most common error occurred when the memory SLAB driver attempted
        to reference the memoryless node to which a CPU was being added
        before the kernel had finished initializing all of the data
        structures for the CPU and exited 'device_online' under
        DLPAR/hot-add.
      
        Normally the memoryless node would be initialized through the call
        path device_online ... arch_update_cpu_topology ... find_cpu_nid ...
        try_online_node. This patch ensures that the powerpc node will be
        initialized as early as possible, even if it was memoryless and
        CPU-less at the point when we are trying to hot-add a new CPU to it.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      e67e02a5
    • M
      powerpc/numa: Ensure nodes initialized for hotplug · ea05ba7c
      Michael Bringmann 提交于
      This patch fixes some problems encountered at runtime with
      configurations that support memory-less nodes, or that hot-add CPUs
      into nodes that are memoryless during system execution after boot. The
      problems of interest include:
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node.
      
      * Nodes which have no resources assigned at boot, but which may still
        be referenced subsequently by affinity or associativity attributes,
        are kept in the list of 'possible' nodes for powerpc. Hot-add of
        memory or CPUs to the system can reference these nodes and bring
        them online instead of redirecting the references to one of the set
        of nodes known to have memory at boot.
      
      Note that this software operates under the context of CPU hotplug. We
      are not doing memory hotplug in this code, but rather updating the
      kernel's CPU topology (i.e. arch_update_cpu_topology /
      numa_update_cpu_topology). We are initializing a node that may be used
      by CPUs or memory before it can be referenced as invalid by a CPU
      hotplug operation. CPU hotplug operations are protected by a range of
      APIs including cpu_maps_update_begin/cpu_maps_update_done,
      cpus_read/write_lock / cpus_read/write_unlock, device locks, and more.
      Memory hotplug operations, including try_online_node, are protected by
      mem_hotplug_begin/mem_hotplug_done, device locks, and more. In the
      case of CPUs being hot-added to a previously memoryless node, the
      try_online_node operation occurs wholly within the CPU locks with no
      overlap. Using HMC hot-add/hot-remove operations, we have been able to
      add and remove CPUs to any possible node without failures. HMC
      operations involve a degree self-serialization, though.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea05ba7c
    • M
      powerpc/numa: Use ibm,max-associativity-domains to discover possible nodes · a346137e
      Michael Bringmann 提交于
      On powerpc systems which allow 'hot-add' of CPU or memory resources,
      it may occur that the new resources are to be inserted into nodes that
      were not used for these resources at bootup. In the kernel, any node
      that is used must be defined and initialized. These empty nodes may
      occur when,
      
      * Dedicated vs. shared resources. Shared resources require information
        such as the VPHN hcall for CPU assignment to nodes. Associativity
        decisions made based on dedicated resource rules, such as
        associativity properties in the device tree, may vary from decisions
        made using the values returned by the VPHN hcall.
      
      * memoryless nodes at boot. Nodes need to be defined as 'possible' at
        boot for operation with other code modules. Previously, the powerpc
        code would limit the set of possible nodes to those which have
        memory assigned at boot, and were thus online. Subsequent add/remove
        of CPUs or memory would only work with this subset of possible
        nodes.
      
      * memoryless nodes with CPUs at boot. Due to the previous restriction
        on nodes, nodes that had CPUs but no memory were being collapsed
        into other nodes that did have memory at boot. In practice this
        meant that the node assignment presented by the runtime kernel
        differed from the affinity and associativity attributes presented by
        the device tree or VPHN hcalls. Nodes that might be known to the
        pHyp were not 'possible' in the runtime kernel because they did not
        have memory at boot.
      
      This patch ensures that sufficient nodes are defined to support
      configuration requirements after boot, as well as at boot. This patch
      set fixes a couple of problems.
      
      * Nodes known to powerpc to be memoryless at boot, but to have CPUs in
        them are allowed to be 'possible' and 'online'. Memory allocations
        for those nodes are taken from another node that does have memory
        until and if memory is hot-added to the node. * Nodes which have no
        resources assigned at boot, but which may still be referenced
        subsequently by affinity or associativity attributes, are kept in
        the list of 'possible' nodes for powerpc. Hot-add of memory or CPUs
        to the system can reference these nodes and bring them online
        instead of redirecting to one of the set of nodes that were known to
        have memory at boot.
      
      This patch extracts the value of the lowest domain level (number of
      allocable resources) from the device tree property
      "ibm,max-associativity-domains" to use as the maximum number of nodes
      to setup as possibly available in the system. This new setting will
      override the instruction:
      
          nodes_and(node_possible_map, node_possible_map, node_online_map);
      
      presently seen in the function arch/powerpc/mm/numa.c:initmem_init().
      
      If the "ibm,max-associativity-domains" property is not present at
      boot, no operation will be performed to define or enable additional
      nodes, or enable the above 'nodes_and()'.
      Signed-off-by: NMichael Bringmann <mwb@linux.vnet.ibm.com>
      Reviewed-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a346137e
    • C
      powerpc/mm/nohash: do not flush the entire mm when range is a single page · 5c8136fa
      Christophe Leroy 提交于
      Most of the time, flush_tlb_range() is called on single pages.
      At the time being, flush_tlb_range() inconditionnaly calls
      flush_tlb_mm() which flushes at least the entire PID pages and on
      older CPUs like 4xx or 8xx it flushes the entire TLB table.
      
      This patch calls flush_tlb_page() instead of flush_tlb_mm() when
      the range is a single page.
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5c8136fa
  6. 22 1月, 2018 2 次提交
  7. 21 1月, 2018 1 次提交
    • A
      powerpc/hash: Skip non initialized page size in init_hpte_page_sizes · 10527e80
      Aneesh Kumar K.V 提交于
      One of the easiest way to test config with 4K HPTE is to disable 64K hardware
      page size like below.
      
      int __init htab_dt_scan_page_sizes(unsigned long node,
      
       		size -= 3; prop += 3;
       		base_idx = get_idx_from_shift(base_shift);
      -		if (base_idx < 0) {
      +		if (base_idx < 0 || base_idx == MMU_PAGE_64K) {
       			/* skip the pte encoding also */
       			prop += lpnum * 2; size -= lpnum * 2;
      
      But then this results in error in other part of the code such as MPSS parsing
      where we look at 4K base page size and 64K actual page size support.
      
      This patch fix MPSS parsing by ignoring the actual page sizes marked
      unsupported. In reality this can happen only with a corrupt device tree. But it
      is good to tighten the error check.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10527e80
  8. 20 1月, 2018 18 次提交
  9. 19 1月, 2018 2 次提交
  10. 17 1月, 2018 5 次提交
    • N
      powerpc/pseries: lift RTAS limit for hash · c610d65c
      Nicholas Piggin 提交于
      With the previous patch to switch to 64-bit mode after returning from
      RTAS and before doing any memory accesses, the RMA limit need not be
      clamped to 1GB to avoid RTAS bugs.
      
      Keep the 1GB limit for older firmware (although this is more of a kernel
      concern than RTAS), and remove it starting with POWER9.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c610d65c
    • N
      powerpc/pseries: lift RTAS limit for radix · 5eae82ca
      Nicholas Piggin 提交于
      With the previous patch to switch to 64-bit mode after returning from
      RTAS and before doing any memory accesses, the RMA limit need not be
      clamped to 1GB to avoid RTAS bugs.
      
      Keep the 1GB limit for older firmware (although this is more of a kernel
      concern than RTAS), and remove it starting with POWER9.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5eae82ca
    • N
      powerpc/pseries: radix is not subject to RMA limit, remove it · 98ae0069
      Nicholas Piggin 提交于
      The radix guest is not subject to the paravirtualized HPT VRMA limit,
      so remove that from ppc64_rma_size calculation for that platform.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      98ae0069
    • N
      powerpc/powernv: Remove real mode access limit for early allocations · 1513c33d
      Nicholas Piggin 提交于
      This removes the RMA limit on powernv platform, which constrains
      early allocations such as PACAs and stacks. There are still other
      restrictions that must be followed, such as bolted SLB limits, but
      real mode addressing has no constraints.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1513c33d
    • N
      powerpc/64s: Improve local TLB flush for boot and MCE on POWER9 · d4748276
      Nicholas Piggin 提交于
      There are several cases outside the normal address space management
      where a CPU's entire local TLB is to be flushed:
      
        1. Booting the kernel, in case something has left stale entries in
           the TLB (e.g., kexec).
      
        2. Machine check, to clean corrupted TLB entries.
      
      One other place where the TLB is flushed, is waking from deep idle
      states. The flush is a side-effect of calling ->cpu_restore with the
      intention of re-setting various SPRs. The flush itself is unnecessary
      because in the first case, the TLB should not acquire new corrupted
      TLB entries as part of sleep/wake (though they may be lost).
      
      This type of TLB flush is coded inflexibly, several times for each CPU
      type, and they have a number of problems with ISA v3.0B:
      
      - The current radix mode of the MMU is not taken into account, it is
        always done as a hash flushn For IS=2 (LPID-matching flush from host)
        and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
        the R field does not match the current radix mode.
      
      - ISA v3.0B hash must flush the partition and process table caches as
        well.
      
      - ISA v3.0B radix must flush partition and process scoped translations,
        partition and process table caches, and also the page walk cache.
      
      So consolidate the flushing code and implement it in C and inline asm
      under the mm/ directory with the rest of the flush code. Add ISA v3.0B
      cases for radix and hash, and use the radix flush in radix environment.
      
      Provide a way for IS=2 (LPID flush) to specify the radix mode of the
      partition. Have KVM pass in the radix mode of the guest.
      
      Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
      and move it later in the boot process after, the MMU registers are set
      up and before relocation is first turned on.
      
      The TLB flush is no longer called when restoring from deep idle states.
      This was not be done as a separate step because booting secondaries
      uses the same cpu_restore as idle restore, which needs the TLB flush.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d4748276
  11. 16 1月, 2018 1 次提交