1. 13 2月, 2018 1 次提交
    • A
      powerpc/mm: Fix crashes with 16G huge pages · fae22116
      Aneesh Kumar K.V 提交于
      To support memory keys, we moved the hash pte slot information to the
      second half of the page table. This was ok with PTE entries at level
      4 (PTE page) and level 3 (PMD). We already allocate larger page table
      pages at those levels to accomodate extra details. For level 4 we
      already have the extra space which was used to track 4k hash page
      table entry details and at level 3 the extra space was allocated to
      track the THP details.
      
      With hugetlbfs PTE, we used this extra space at the PMD level to store
      the slot details. But we also support hugetlbfs PTE at PUD level for
      16GB pages and PUD level page didn't allocate extra space. This
      resulted in memory corruption.
      
      Fix this by allocating extra space at PUD level when HUGETLB is
      enabled.
      
      Fixes: bf9a95f9 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fae22116
  2. 22 1月, 2018 1 次提交
  3. 21 1月, 2018 1 次提交
    • A
      powerpc/hash: Skip non initialized page size in init_hpte_page_sizes · 10527e80
      Aneesh Kumar K.V 提交于
      One of the easiest way to test config with 4K HPTE is to disable 64K hardware
      page size like below.
      
      int __init htab_dt_scan_page_sizes(unsigned long node,
      
       		size -= 3; prop += 3;
       		base_idx = get_idx_from_shift(base_shift);
      -		if (base_idx < 0) {
      +		if (base_idx < 0 || base_idx == MMU_PAGE_64K) {
       			/* skip the pte encoding also */
       			prop += lpnum * 2; size -= lpnum * 2;
      
      But then this results in error in other part of the code such as MPSS parsing
      where we look at 4K base page size and 64K actual page size support.
      
      This patch fix MPSS parsing by ignoring the actual page sizes marked
      unsupported. In reality this can happen only with a corrupt device tree. But it
      is good to tighten the error check.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      10527e80
  4. 20 1月, 2018 3 次提交
  5. 17 1月, 2018 3 次提交
    • N
      powerpc/pseries: lift RTAS limit for hash · c610d65c
      Nicholas Piggin 提交于
      With the previous patch to switch to 64-bit mode after returning from
      RTAS and before doing any memory accesses, the RMA limit need not be
      clamped to 1GB to avoid RTAS bugs.
      
      Keep the 1GB limit for older firmware (although this is more of a kernel
      concern than RTAS), and remove it starting with POWER9.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c610d65c
    • N
      powerpc/powernv: Remove real mode access limit for early allocations · 1513c33d
      Nicholas Piggin 提交于
      This removes the RMA limit on powernv platform, which constrains
      early allocations such as PACAs and stacks. There are still other
      restrictions that must be followed, such as bolted SLB limits, but
      real mode addressing has no constraints.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1513c33d
    • N
      powerpc/64s: Improve local TLB flush for boot and MCE on POWER9 · d4748276
      Nicholas Piggin 提交于
      There are several cases outside the normal address space management
      where a CPU's entire local TLB is to be flushed:
      
        1. Booting the kernel, in case something has left stale entries in
           the TLB (e.g., kexec).
      
        2. Machine check, to clean corrupted TLB entries.
      
      One other place where the TLB is flushed, is waking from deep idle
      states. The flush is a side-effect of calling ->cpu_restore with the
      intention of re-setting various SPRs. The flush itself is unnecessary
      because in the first case, the TLB should not acquire new corrupted
      TLB entries as part of sleep/wake (though they may be lost).
      
      This type of TLB flush is coded inflexibly, several times for each CPU
      type, and they have a number of problems with ISA v3.0B:
      
      - The current radix mode of the MMU is not taken into account, it is
        always done as a hash flushn For IS=2 (LPID-matching flush from host)
        and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if
        the R field does not match the current radix mode.
      
      - ISA v3.0B hash must flush the partition and process table caches as
        well.
      
      - ISA v3.0B radix must flush partition and process scoped translations,
        partition and process table caches, and also the page walk cache.
      
      So consolidate the flushing code and implement it in C and inline asm
      under the mm/ directory with the rest of the flush code. Add ISA v3.0B
      cases for radix and hash, and use the radix flush in radix environment.
      
      Provide a way for IS=2 (LPID flush) to specify the radix mode of the
      partition. Have KVM pass in the radix mode of the guest.
      
      Take out the flushes from early cputable/dt_cpu_ftrs detection hooks,
      and move it later in the boot process after, the MMU registers are set
      up and before relocation is first turned on.
      
      The TLB flush is no longer called when restoring from deep idle states.
      This was not be done as a separate step because booting secondaries
      uses the same cpu_restore as idle restore, which needs the TLB flush.
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d4748276
  6. 20 12月, 2017 3 次提交
    • R
      powerpc: use helper functions to get and set hash slots · a8548686
      Ram Pai 提交于
      replace redundant code in __hash_page_4K() and flush_hash_page()
      with helper functions pte_get_hash_gslot() and pte_set_hidx()
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      a8548686
    • R
      powerpc: Free up four 64K PTE bits in 4K backed HPTE pages · 9d2edb18
      Ram Pai 提交于
      Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6,
      in the 4K backed HPTE pages.These bits continue to be used
      for 64K backed HPTE pages in this patch, but will be freed
      up in the next patch. The bit numbers are big-endian as
      defined in the ISA3.0
      
      The patch does the following change to the 4k HTPE backed
      64K PTE's format.
      
      H_PAGE_BUSY moves from bit 3 to bit 9 (B bit in the figure
      		below)
      V0 which occupied bit 4 is not used anymore.
      V1 which occupied bit 5 is not used anymore.
      V2 which occupied bit 6 is not used anymore.
      V3 which occupied bit 7 is not used anymore.
      
      Before the patch, the 4k backed 64k PTE format was as follows
      
       0 1 2 3 4  5  6  7  8 9 10...........................63
       : : : : :  :  :  :  : : :                            :
       v v v v v  v  v  v  v v v                            v
      
      ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
      |x|x|x|B|V0|V1|V2|V3|x| | |x|x|................|x|x|x|x| <- primary pte
      '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
      |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
      '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'
      
      After the patch, the 4k backed 64k PTE format is as follows
      
       0 1 2 3 4  5  6  7  8 9 10...........................63
       : : : : :  :  :  :  : : :                            :
       v v v v v  v  v  v  v v v                            v
      
      ,-,-,-,-,--,--,--,--,-,-,-,-,-,------------------,-,-,-,
      |x|x|x| |  |  |  |  |x|B| |x|x|................|.|.|.|.| <- primary pte
      '_'_'_'_'__'__'__'__'_'_'_'_'_'________________'_'_'_'_'
      |S|G|I|X|S |G |I |X |S|G|I|X|..................|S|G|I|X| <- secondary pte
      '_'_'_'_'__'__'__'__'_'_'_'_'__________________'_'_'_'_'
      
      the four bits S,G,I,X (one quadruplet per 4k HPTE) that
      cache the hash-bucket slot value, is initialized to
      1,1,1,1 indicating -- an invalid slot. If a HPTE gets
      cached in a 1111 slot(i.e 7th slot of secondary hash
      bucket), it is released immediately. In other words,
      even though 1111 is a valid slot value in the hash
      bucket, we consider it invalid and release the slot and
      the HPTE. This gives us the opportunity to determine
      the validity of S,G,I,X bits based on its contents and
      not on any of the bits V0,V1,V2 or V3 in the primary PTE
      
      When we release a HPTE cached in the 1111 slot
      we also release a legitimate slot in the primary
      hash bucket and unmap its corresponding HPTE. This
      is to ensure that we do get a HPTE cached in a slot
      of the primary hash bucket, the next time we retry.
      
      Though treating 1111 slot as invalid, reduces the
      number of available slots in the hash bucket and may
      have an effect on the performance, the probabilty of
      hitting a 1111 slot is extermely low.
      
      Compared to the current scheme, the above scheme
      reduces the number of false hash table updates
      significantly and has the added advantage of releasing
      four valuable PTE bits for other purpose.
      
      NOTE:even though bits 3, 4, 5, 6, 7 are not used when
      the 64K PTE is backed by 4k HPTE, they continue to be
      used if the PTE gets backed by 64k HPTE. The next
      patch will decouple that aswell, and truely release the
      bits.
      
      This idea was jointly developed by Paul Mackerras,
      Aneesh, Michael Ellermen and myself.
      
      4K PTE format remains unchanged currently.
      
      The patch does the following code changes
      a) PTE flags are split between 64k and 4k header files.
      b) __hash_page_4K() is reimplemented to reflect the
       above logic.
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d2edb18
    • R
      powerpc: introduce pte_get_hash_gslot() helper · 318995b4
      Ram Pai 提交于
      Introduce pte_get_hash_gslot()() which returns the global slot number of
      the HPTE in the global hash table.
      
      This function will come in handy as we work towards re-arranging the PTE
      bits in the later patches.
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      318995b4
  7. 06 11月, 2017 1 次提交
    • A
      powerpc/mm/hash: Add pr_fmt() to hash_utils64.c · 7f142661
      Aneesh Kumar K.V 提交于
      Make the printks look a bit nicer by adding a prefix.
      
      Radix config now do
       radix-mmu: Page sizes from device-tree:
       radix-mmu: Page size shift = 12 AP=0x0
       radix-mmu: Page size shift = 16 AP=0x5
       radix-mmu: Page size shift = 21 AP=0x1
       radix-mmu: Page size shift = 30 AP=0x2
      
      This patch update hash config to do similar dmesg output. With the patch we have
      
       hash-mmu: Page sizes from device-tree:
       hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
       hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
       hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
       hash-mmu: base_shift=16: shift=16, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=1
       hash-mmu: base_shift=16: shift=24, sllp=0x0110, avpnm=0x00000000, tlbiel=1, penc=8
       hash-mmu: base_shift=20: shift=20, sllp=0x0111, avpnm=0x00000000, tlbiel=0, penc=2
       hash-mmu: base_shift=24: shift=24, sllp=0x0100, avpnm=0x00000001, tlbiel=0, penc=0
       hash-mmu: base_shift=34: shift=34, sllp=0x0120, avpnm=0x000007ff, tlbiel=0, penc=3
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      7f142661
  8. 23 8月, 2017 1 次提交
  9. 17 8月, 2017 1 次提交
    • A
      powerpc/mm: Rename find_linux_pte_or_hugepte() · 94171b19
      Aneesh Kumar K.V 提交于
      Add newer helpers to make the function usage simpler. It is always
      recommended to use find_current_mm_pte() for walking the page table.
      If we cannot use find_current_mm_pte(), it should be documented why
      the said usage of __find_linux_pte() is safe against a parallel THP
      split.
      
      For now we have KVM code using __find_linux_pte(). This is because kvm
      code ends up calling __find_linux_pte() in real mode with MSR_EE=0 but
      with PACA soft_enabled = 1. We may want to fix that later and make
      sure we keep the MSR_EE and PACA soft_enabled in sync. When we do that
      we can switch kvm to use find_linux_pte().
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      94171b19
  10. 16 8月, 2017 1 次提交
    • A
      powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line · 79cc38de
      Aneesh Kumar K.V 提交于
      With commit aa888a74 ("hugetlb: support larger than MAX_ORDER") we added
      support for allocating gigantic hugepages via kernel command line. Switch
      ppc64 arch specific code to use that.
      
      W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE.
      
      We use the kernel command line to do reservation of hugetlb pages on powernv
      platforms. On pseries hash mmu mode the supported gigantic huge page size is
      16GB and that can only be allocated with hypervisor assist. For pseries the
      command line option doesn't do the allocation. Instead pseries does gigantic
      hugepage allocation based on hypervisor hint that is specified via
      "ibm,expected#pages" property of the memory node.
      
      Cc: Scott Wood <oss@buserror.net>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      79cc38de
  11. 08 8月, 2017 1 次提交
  12. 31 7月, 2017 1 次提交
  13. 23 6月, 2017 1 次提交
    • B
      powerpc/mm: Trace tlbie(l) instructions · 0428491c
      Balbir Singh 提交于
      Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate
      Entry (Local)) instructions.
      
      The tlbie instruction has changed over the years, so not all versions
      accept the same operands. Use the ISA v3 field operands because they are
      the most verbose, we may change them in future.
      
      Example output:
      
        qemu-system-ppc-5371  [016]  1412.369519: tlbie:
        	tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [mpe: Add some missing trace_tlbie()s, reword change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      0428491c
  14. 11 4月, 2017 1 次提交
  15. 03 4月, 2017 1 次提交
  16. 01 4月, 2017 1 次提交
    • A
      powerpc/pseries: Skip using reserved virtual address range · 82228e36
      Aneesh Kumar K.V 提交于
      Now that we use all the available virtual address range, we need to make
      sure we don't generate VSID such that it overlaps with the reserved vsid
      range. Reserved vsid range include the virtual address range used by the
      adjunct partition and also the VRMA virtual segment. We find the context
      value that can result in generating such a VSID and reserve it early in
      boot.
      
      We don't look at the adjunct range, because for now we disable the
      adjunct usage in a Linux LPAR via CAS interface.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      82228e36
  17. 31 3月, 2017 2 次提交
  18. 02 3月, 2017 1 次提交
  19. 10 2月, 2017 2 次提交
  20. 17 1月, 2017 1 次提交
  21. 25 12月, 2016 1 次提交
  22. 25 11月, 2016 1 次提交
  23. 23 11月, 2016 1 次提交
    • P
      powerpc/64: Provide functions for accessing POWER9 partition table · 9d661958
      Paul Mackerras 提交于
      POWER9 requires the host to set up a partition table, which is a
      table in memory indexed by logical partition ID (LPID) which
      contains the pointers to page tables and process tables for the
      host and each guest.
      
      This factors out the initialization of the partition table into
      a single function.  This code was previously duplicated between
      hash_utils_64.c and pgtable-radix.c.
      
      This provides a function for setting a partition table entry,
      which is used in early MMU initialization, and will be used by
      KVM whenever a guest is created.  This function includes a tlbie
      instruction which will flush all TLB entries for the LPID and
      all caches of the partition table entry for the LPID, across the
      system.
      
      This also moves a call to memblock_set_current_limit(), which was
      in radix_init_partition_table(), but has nothing to do with the
      partition table.  By analogy with the similar code for hash, the
      call gets moved to near the end of radix__early_init_mmu().  It
      now gets called when running as a guest, whereas previously it
      would only be called if the kernel is running as the host.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9d661958
  24. 18 11月, 2016 1 次提交
  25. 12 10月, 2016 1 次提交
    • M
      powerpc/mm/hash64: Fix might_have_hea() check · 08bf75ba
      Michael Ellerman 提交于
      In commit 2b4e3ad8 ("powerpc/mm/hash64: Don't test for machine type
      to detect HEA special case") we changed the logic in might_have_hea()
      to check FW_FEATURE_SPLPAR rather than machine_is(pseries).
      
      However the check was incorrectly negated, leading to crashes on
      machines with HEA adapters, such as:
      
        mm: Hashing failure ! EA=0xd000080080004040 access=0x800000000000000e current=NetworkManager
            trap=0x300 vsid=0x13d349c ssize=1 base psize=2 psize 2 pte=0xc0003cc033e701ae
        Unable to handle kernel paging request for data at address 0xd000080080004040
        Call Trace:
          .ehea_create_cq+0x148/0x340 [ehea] (unreliable)
          .ehea_up+0x258/0x1200 [ehea]
          .ehea_open+0x44/0x1a0 [ehea]
          ...
      
      Fix it by removing the negation.
      
      Fixes: 2b4e3ad8 ("powerpc/mm/hash64: Don't test for machine type to detect HEA special case")
      Cc: stable@vger.kernel.org # v4.8+
      Reported-by: NDenis Kirjanov <kda@linux-powerpc.org>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      08bf75ba
  26. 23 9月, 2016 1 次提交
  27. 13 9月, 2016 1 次提交
  28. 09 9月, 2016 1 次提交
    • P
      powerpc/mm: Speed up computation of base and actual page size for a HPTE · 0eeede0c
      Paul Mackerras 提交于
      This replaces a 2-D search through an array with a simple 8-bit table
      lookup for determining the actual and/or base page size for a HPT entry.
      
      The encoding in the second doubleword of the HPTE is designed to encode
      the actual and base page sizes without using any more bits than would be
      needed for a 4k page number, by using between 1 and 8 low-order bits of
      the RPN (real page number) field to encode the page sizes.  A single
      "large page" bit in the first doubleword indicates that these low-order
      bits are to be interpreted like this.
      
      We can determine the page sizes by using the low-order 8 bits of the RPN
      to look up a 256-entry table.  For actual page sizes less than 1MB, some
      of the upper bits of these 8 bits are going to be real address bits, but
      we can cope with that by replicating the entries for those smaller page
      sizes.
      
      While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
      functions from a KVM-specific header to a header for 64-bit HPT systems,
      since this computation doesn't have anything specifically to do with KVM.
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0eeede0c
  29. 01 8月, 2016 2 次提交
  30. 28 7月, 2016 1 次提交
  31. 26 7月, 2016 1 次提交