1. 30 3月, 2018 2 次提交
  2. 17 1月, 2018 4 次提交
    • N
      powerpc/pseries: lift RTAS limit for radix · 5eae82ca
      Nicholas Piggin 提交于
      With the previous patch to switch to 64-bit mode after returning from
      RTAS and before doing any memory accesses, the RMA limit need not be
      clamped to 1GB to avoid RTAS bugs.
      
      Keep the 1GB limit for older firmware (although this is more of a kernel
      concern than RTAS), and remove it starting with POWER9.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5eae82ca
    • N
      powerpc/pseries: radix is not subject to RMA limit, remove it · 98ae0069
      Nicholas Piggin 提交于
      The radix guest is not subject to the paravirtualized HPT VRMA limit,
      so remove that from ppc64_rma_size calculation for that platform.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      98ae0069
    • N
      powerpc/powernv: Remove real mode access limit for early allocations · 1513c33d
      Nicholas Piggin 提交于
      This removes the RMA limit on powernv platform, which constrains
      early allocations such as PACAs and stacks. There are still other
      restrictions that must be followed, such as bolted SLB limits, but
      real mode addressing has no constraints.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1513c33d
    • N
      powerpc/64s: Improve local TLB flush for boot and MCE on POWER9 · d4748276
      Nicholas Piggin 提交于
      There are several cases outside the normal address space management
      where a CPU's entire local TLB is to be flushed:
      
        1. Booting the kernel, in case something has left stale entries in
           the TLB (e.g., kexec).
      
        2. Machine check, to clean corrupted TLB entries.
      
      One other place where the TLB is flushed, is waking from deep idle
      states. The flush is a side-effect of calling ->cpu_restore with the
      intention of re-setting various SPRs. The flush itself is unnecessary
      because in the first case, the TLB should not acquire new corrupted
      TLB entries as part of sleep/wake (though they may be lost).
      
      This type of TLB flush is coded inflexibly, several times for each CPU
      type, and they have a number of problems with ISA v3.0B:
      
      - The current radix mode of the MMU is not taken into account, it is
        always done as a hash flushn For IS=2 (LPID-matching flush from host)
        and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
        the R field does not match the current radix mode.
      
      - ISA v3.0B hash must flush the partition and process table caches as
        well.
      
      - ISA v3.0B radix must flush partition and process scoped translations,
        partition and process table caches, and also the page walk cache.
      
      So consolidate the flushing code and implement it in C and inline asm
      under the mm/ directory with the rest of the flush code. Add ISA v3.0B
      cases for radix and hash, and use the radix flush in radix environment.
      
      Provide a way for IS=2 (LPID flush) to specify the radix mode of the
      partition. Have KVM pass in the radix mode of the guest.
      
      Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
      and move it later in the boot process after, the MMU registers are set
      up and before relocation is first turned on.
      
      The TLB flush is no longer called when restoring from deep idle states.
      This was not be done as a separate step because booting secondaries
      uses the same cpu_restore as idle restore, which needs the TLB flush.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d4748276
  3. 12 11月, 2017 1 次提交
    • B
      powerpc/mm/radix: Fix crashes on Power9 DD1 with radix MMU and STRICT_RWX · f79ad50e
      Balbir Singh 提交于
      When using the radix MMU on Power9 DD1, to work around a hardware
      problem, radix__pte_update() is required to do a two stage update of
      the PTE. First we write a zero value into the PTE, then we flush the
      TLB, and then we write the new PTE value.
      
      In the normal case that works OK, but it does not work if we're
      updating the PTE that maps the code we're executing, because the
      mapping is removed by the TLB flush and we can no longer execute from
      it. Unfortunately the STRICT_RWX code needs to do exactly that.
      
      The exact symptoms when we hit this case vary, sometimes we print an
      oops and then get stuck after that, but I've also seen a machine just
      get stuck continually page faulting with no oops printed. The variance
      is presumably due to the exact layout of the text and the page size
      used for the mappings. In all cases we are unable to boot to a shell.
      
      There are possible solutions such as creating a second mapping of the
      TLB flush code, executing from that, and then jumping back to the
      original. However we don't want to add that level of complexity for a
      DD1 work around.
      
      So just detect that we're running on Power9 DD1 and refrain from
      changing the permissions, effectively disabling STRICT_RWX on Power9
      DD1.
      
      Fixes: 7614ff32 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix")
      Cc: stable@vger.kernel.org # v4.13+
      Reported-by: NAndrew Jeffery <andrew@aj.id.au>
      [Changelog as suggested by Michael Ellerman <mpe@ellerman.id.au>]
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f79ad50e
  4. 31 8月, 2017 2 次提交
  5. 17 8月, 2017 1 次提交
    • A
      powerpc/mm: Don't send IPI to all cpus on THP updates · fa4531f7
      Aneesh Kumar K.V 提交于
      Now that we made sure that lockless walk of linux page table is mostly
      limitted to current task(current->mm->pgdir) we can update the THP
      update sequence to only send IPI to CPUs on which this task has run.
      This helps in reducing the IPI overload on systems with large number
      of CPUs.
      
      WRT kvm even though kvm is walking page table with vpc->arch.pgdir,
      it is done only on secondary CPUs and in that case we have primary CPU
      added to task's mm cpumask. Sending an IPI to primary will force the
      secondary to do a vm exit and hence this mm cpumask usage is safe
      here.
      
      WRT CAPI, we still end up walking linux page table with capi context
      MM. For now the pte lookup serialization sends an IPI to all CPUs in
      CPI is in use. We can further improve this by adding the CAPI
      interrupt handling CPU to task mm cpumask. That will be done in a
      later patch.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fa4531f7
  6. 10 8月, 2017 1 次提交
    • S
      powerpc/mm: Properly invalidate when setting process table base · 7cd2a869
      Suraj Jitindar Singh 提交于
      The host process table base is stored in the partition table by calling
      the function native_register_process_table(). Currently this just sets
      the entry in memory and is missing a subsequent cache invalidation
      instruction. Any update to the partition table should be followed by a
      cache invalidation instruction specifying invalidation of the caching of
      any partition table entries (RIC = 2, PRS = 0).
      
      We already have a function to update the partition table with the
      required cache invalidation instructions - mmu_partition_table_set_entry().
      Update the native_register_process_table() function to call
      mmu_partition_table_set_entry(), this ensures all appropriate
      invalidation will be performed.
      Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Use a local for patb0 to clean it up slightly]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7cd2a869
  7. 08 8月, 2017 1 次提交
  8. 02 8月, 2017 1 次提交
  9. 26 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b
      Benjamin Herrenschmidt 提交于
      There's a somewhat architectural issue with Radix MMU and KVM.
      
      When coming out of a guest with AIL (Alternate Interrupt Location, ie,
      MMU enabled), we start executing hypervisor code with the PID register
      still containing whatever the guest has been using.
      
      The problem is that the CPU can (and will) then start prefetching or
      speculatively load from whatever host context has that same PID (if
      any), thus bringing translations for that context into the TLB, which
      Linux doesn't know about.
      
      This can cause stale translations and subsequent crashes.
      
      Fixing this in a way that is neither racy nor a huge performance
      impact is difficult. We could just make the host invalidations always
      use broadcast forms but that would hurt single threaded programs for
      example.
      
      We chose to fix it instead by partitioning the PID space between guest
      and host. This is possible because today Linux only use 19 out of the
      20 bits of PID space, so existing guests will work if we make the host
      use the top half of the 20 bits space.
      
      We additionally add support for a property to indicate to Linux the
      size of the PID register which will be useful if we eventually have
      processors with a larger PID space available.
      
      There is still an issue with malicious guests purposefully setting the
      PID register to a value in the hosts PID range. Hopefully future HW
      can prevent that, but in the meantime, we handle it with a pair of
      kludges:
      
       - On the way out of a guest, before we clear the current VCPU in the
         PACA, we check the PID and if it's outside of the permitted range
         we flush the TLB for that PID.
      
       - When context switching, if the mm is "new" on that CPU (the
         corresponding bit was set for the first time in the mm cpumask), we
         check if any sibling thread is in KVM (has a non-NULL VCPU pointer
         in the PACA). If that is the case, we also flush the PID for that
         CPU (core).
      
      This second part is needed to handle the case where a process is
      migrated (or starts a new pthread) on a sibling thread of the CPU
      coming out of KVM, as there's a window where stale translations can
      exist before we detect it and flush them out.
      
      A future optimization could be added by keeping track of whether the
      PID has ever been used and avoid doing that for completely fresh PIDs.
      We could similarily mark PIDs that have been the subject of a global
      invalidation as "fresh". But for now this will do.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
            unneeded include of kvm_book3s_asm.h]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a25bd72b
  10. 18 7月, 2017 2 次提交
  11. 04 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix · 7614ff32
      Balbir Singh 提交于
      The Radix linear mapping code (create_physical_mapping()) tries to use
      the largest page size it can at each step. Currently the only reason
      it steps down to a smaller page size is if the start addr is
      unaligned (never happens in practice), or the end of memory is not
      aligned to a huge page boundary.
      
      To support STRICT_RWX we need to break the mapping at __init_begin,
      so that the text and rodata prior to that can be marked R_X and the
      regular pages after can be marked RW.
      
      Having done that we can now implement mark_rodata_ro() for Radix,
      knowing that we won't need to split any mappings.
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Split down to PAGE_SIZE, not 2MB, rewrite change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7614ff32
  12. 03 7月, 2017 1 次提交
    • B
      powerpc/mm/radix: Fix execute permissions for interrupt_vectors · 7f6d498e
      Balbir Singh 提交于
      Commit 9abcc981 ("powerpc/mm/radix: Only add X for pages
      overlapping kernel text") changed the linear mapping on Radix to only
      mark the kernel text executable.
      
      However if the kernel is run relocated, for example as a kdump kernel,
      then the exception vectors are split from the kernel text, ie. they
      remain at real address 0.
      
      We tend to get away with it, because the kernel itself will usually be
      below 1G, which means the 1G page at 0-1G is marked executable and
      everything works OK. However if the kernel is loaded above 1G, or the
      system has less than 1G in total (meaning we can't use a 1G page),
      then the exception vectors will not be marked executable and the
      kernel will fail to boot.
      
      Fix it by also checking if the address range overlaps the exception
      vectors when deciding if we should add PAGE_KERNEL_X.
      
      Fixes: 9abcc981 ("powerpc/mm/radix: Only add X for pages overlapping kernel text")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Combine with the existing check, rewrite change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f6d498e
  13. 02 7月, 2017 1 次提交
  14. 23 6月, 2017 1 次提交
    • B
      powerpc/mm: Trace tlbie(l) instructions · 0428491c
      Balbir Singh 提交于
      Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
      Entry (Local)) instructions.
      
      The tlbie instruction has changed over the years, so not all versions
      accept the same operands. Use the ISA v3 field operands because they are
      the most verbose, we may change them in future.
      
      Example output:
      
        qemu-system-ppc-5371  [016]  1412.369519: tlbie:
        	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Add some missing trace_tlbie()s, reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0428491c
  15. 15 6月, 2017 1 次提交
    • M
      powerpc/mm/radix: Only add X for pages overlapping kernel text · 9abcc981
      Michael Ellerman 提交于
      Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we
      should check if the page overlaps the kernel text and only then add
      PAGE_KERNEL_X.
      
      Note that we still use 1G pages if they're available, so this will
      typically still result in a 1G executable page at KERNELBASE. So this fix is
      primarily useful for catching stray branches to high linear mapping addresses.
      
      Without this patch, we can execute at 1G in xmon using:
      
        0:mon> m c000000040000000
        c000000040000000  00 l
        c000000040000000  00000000 01006038
        c000000040000004  00000000 2000804e
        c000000040000008  00000000 x
        0:mon> di c000000040000000
        c000000040000000  38600001      li      r3,1
        c000000040000004  4e800020      blr
        0:mon> p c000000040000000
        return value is 0x1
      
      After we get a 400 as expected:
      
        0:mon> p c000000040000000
        *** 400 exception occurred
      
      Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines")
      Cc: stable@vger.kernel.org # v4.7+
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      9abcc981
  16. 03 3月, 2017 1 次提交
  17. 02 3月, 2017 1 次提交
  18. 31 1月, 2017 5 次提交
  19. 30 1月, 2017 2 次提交
  20. 26 11月, 2016 1 次提交
    • B
      powerpc/mm/radix: Prevent kernel execution of user space · 3b10d009
      Balbir Singh 提交于
      ISA 3 defines new encoded access authority that allows instruction
      access prevention in privileged mode and allows normal access
      to problem state. This patch just enables IAMR (Instruction Authority
      Mask Register), enabling AMR would require more work.
      
      I've tested this with a buggy driver and a simple payload. The payload
      is specific to the build I've tested.
      
      mpe: Also tested with LKDTM:
      
        # echo EXEC_USERSPACE > /sys/kernel/debug/provoke-crash/DIRECT
        lkdtm: Performing direct entry EXEC_USERSPACE
        lkdtm: attempting ok execution at c0000000005bf560
        lkdtm: attempting bad execution at 00003fff8d940000
        Unable to handle kernel paging request for instruction fetch
        Faulting instruction address: 0x3fff8d940000
        Oops: Kernel access of bad area, sig: 11 [#1]
        NIP: 00003fff8d940000 LR: c0000000005bfa58 CTR: 00003fff8d940000
        REGS: c0000000f1fcf900 TRAP: 0400   Not tainted  (4.9.0-rc5-compiler_gcc-6.2.0-00109-g956dbc06232a)
        MSR: 9000000010009033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 48002222  XER: 00000000
        ...
        Call Trace:
          lkdtm_EXEC_USERSPACE+0x104/0x120 (unreliable)
          lkdtm_do_action+0x3c/0x80
          direct_entry+0x100/0x1b0
          full_proxy_write+0x94/0x100
          __vfs_write+0x3c/0x1b0
          vfs_write+0xcc/0x230
          SyS_write+0x60/0x110
          system_call+0x38/0xfc
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      3b10d009
  21. 25 11月, 2016 1 次提交
  22. 23 11月, 2016 1 次提交
    • P
      powerpc/64: Provide functions for accessing POWER9 partition table · 9d661958
      Paul Mackerras 提交于
      POWER9 requires the host to set up a partition table, which is a
      table in memory indexed by logical partition ID (LPID) which
      contains the pointers to page tables and process tables for the
      host and each guest.
      
      This factors out the initialization of the partition table into
      a single function.  This code was previously duplicated between
      hash_utils_64.c and pgtable-radix.c.
      
      This provides a function for setting a partition table entry,
      which is used in early MMU initialization, and will be used by
      KVM whenever a guest is created.  This function includes a tlbie
      instruction which will flush all TLB entries for the LPID and
      all caches of the partition table entry for the LPID, across the
      system.
      
      This also moves a call to memblock_set_current_limit(), which was
      in radix_init_partition_table(), but has nothing to do with the
      partition table.  By analogy with the similar code for hash, the
      call gets moved to near the end of radix__early_init_mmu().  It
      now gets called when running as a guest, whereas previously it
      would only be called if the kernel is running as the host.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d661958
  23. 18 11月, 2016 1 次提交
  24. 17 11月, 2016 2 次提交
  25. 23 9月, 2016 1 次提交
  26. 13 9月, 2016 1 次提交
  27. 04 8月, 2016 1 次提交
    • M
      powerpc/mm: Move register_process_table() out of ppc_md · eea8148c
      Michael Ellerman 提交于
      We want to initialise register_process_table() before ppc_md is setup,
      so that it can be called as part of MMU init (at least on Radix ATM).
      
      That no longer works because probe_machine() requires that ppc_md be
      empty before it's called, and we now do probe_machine() much later.
      
      So make register_process_table a global for now. It will probably move
      into a mmu_radix_ops struct at some point in the future.
      
      This was broken by me when applying commit 7025776e "powerpc/mm:
      Move hash table ops to a separate structure" due to conflicts with other
      patches.
      
      Fixes: 7025776e ("powerpc/mm: Move hash table ops to a separate structure")
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      eea8148c
  28. 01 8月, 2016 1 次提交