1. 08 10月, 2014 3 次提交
  2. 25 9月, 2014 3 次提交
  3. 05 8月, 2014 1 次提交
  4. 28 7月, 2014 2 次提交
  5. 06 6月, 2014 1 次提交
    • M
      powerpc/mm: Check paca psize is up to date for huge mappings · 09567e7f
      Michael Ellerman 提交于
      We have a bug in our hugepage handling which exhibits as an infinite
      loop of hash faults. If the fault is being taken in the kernel it will
      typically trigger the softlockup detector, or the RCU stall detector.
      
      The bug is as follows:
      
       1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
       2. Slice code converts the slice psize to 16M.
       3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
          synchronises the mm->context with the paca->context. So the paca slice
          mask is updated to include the 16M slice.
       3. Either:
          * mmap() fails because there are no huge pages available.
          * mmap() succeeds and the mapping is then munmapped.
          In both cases the slice psize remains at 16M in both the paca & mm.
       4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
       5. The slice psize is converted back to 64K. Because of the check on line 539
          of slice.c we DO NOT update the paca->context. The paca slice mask is now
          out of sync with the mm slice mask.
       6. User/kernel accesses 0xa0000000.
       7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
          to create an SLB entry and inserts it in the SLB.
      18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
          is found so a data access exception is generated.
      19. The data access handler calls do_page_fault() -> handle_mm_fault().
      10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
      11. The hardware retries the access, there is still nothing in the hash table
          so once again a data access exception is generated.
      12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
          hash. Although the THP mapping maps 16M the hashing is done using 64K
          as the segment page size.
      13. hash_page() returns immediately after calling __hash_page_thp(), skipping
          over the code at line 1125. Resulting in the mismatch between the
          paca->context and mm->context not being detected.
      14. The hardware retries the access, the hash it generates using the 16M
          SLB entry does NOT match the hash we inserted.
      15. We take another data access and go into __hash_page_thp().
      16. We see a valid entry in the hpte_slot_array and so we call updatepp()
          which succeeds.
      17. Goto 14.
      
      We could fix this in two ways. The first would be to remove or modify
      the check on line 539 of slice.c.
      
      The second option is to cause the check of paca psize in hash_page() on
      line 1125 to also be done for THP pages.
      
      We prefer the latter, because the check & update of the paca psize is
      not done until we know it's necessary. It's also done only on the
      current cpu, so we don't need to IPI all other cpus.
      
      Without further rearranging the code, the simplest fix is to pull out
      the code that checks paca psize and call it in two places. Firstly for
      THP/hugetlb, and secondly for other mappings as before.
      
      Thanks to Dave Jones for trinity, which originally found this bug.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: stable@vger.kernel.org [v3.11+]
      09567e7f
  6. 01 5月, 2014 1 次提交
  7. 30 4月, 2014 1 次提交
  8. 29 4月, 2014 1 次提交
    • A
      KVM guest: Make pv trampoline code executable · b18db0b8
      Alexander Graf 提交于
      Our PV guest patching code assembles chunks of instructions on the fly when it
      encounters more complicated instructions to hijack. These instructions need
      to live in a section that we don't mark as non-executable, as otherwise we
      fault when jumping there.
      
      Right now we put it into the .bss section where it automatically gets marked
      as non-executable. Add a check to the NX setting function to ensure that we
      leave these particular pages executable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b18db0b8
  9. 23 4月, 2014 2 次提交
  10. 11 2月, 2014 1 次提交
    • M
      powerpc: Fix kdump hang issue on p8 with relocation on exception enabled. · 429d2e83
      Mahesh Salgaonkar 提交于
      On p8 systems, with relocation on exception feature enabled we are seeing
      kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
      feature enabled, exception are raised with MMU (IR=DR=1) ON with the
      default offset of 0xc*4000. Since exception is raised in virtual mode it
      requires the vector region to be executable without which it fails to
      fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
      is loaded at real 0, the htab mappings sets the entire kernel text region
      executable. But for relocatable kernel (e.g. kdump case) we only copy
      interrupt vectors down to real 0 and never marked that region as
      executable because in p7 and below we always get exception in real mode.
      
      This patch fixes this issue by marking htab mapping range as executable
      that overlaps with the interrupt vector region for relocatable kernel.
      
      Thanks to Ben who helped me to debug this issue and find the root cause.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      429d2e83
  11. 09 12月, 2013 1 次提交
  12. 11 10月, 2013 1 次提交
  13. 14 8月, 2013 1 次提交
  14. 01 7月, 2013 1 次提交
    • P
      powerpc: Delete __cpuinit usage from all users · 061d19f2
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the powerpc uses of the __cpuinit macros.  There
      are no __CPUINIT users in assembly files in powerpc.
      
      [1] https://lkml.org/lkml/2013/5/20/589
      
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Josh Boyer <jwboyer@gmail.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Kumar Gala <galak@kernel.crashing.org>
      Cc: linuxppc-dev@lists.ozlabs.org
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      061d19f2
  15. 21 6月, 2013 4 次提交
    • A
      powerpc: Make linux pagetable walk safe with THP enabled · 0ac52dd7
      Aneesh Kumar K.V 提交于
      We need to have irqs disabled to handle all the possible parallel update for
      linux page table without holding locks.
      
      Events that we are intersted in while walking page tables are
      1) Page fault
      2) umap
      3) THP split
      4) THP collapse
      
      A) local_irq_disabled:
      ------------------------
      1) page fault:
      A none to valid transition via page fault is not an issue because we
      would either see a none or valid. If it is none, we would error out
      the page table walk. We may need to use on stack values when checking for
      type of page table elements, because if we do
      
      if (!is_hugepd()) {
          if (!pmd_none() {
             if (pmd_bad() {
      
      We could take that bad condition because the pmd got converted to a hugepd
      after the !is_hugepd check via a hugetlb fault.
      
      The right way would be to check for pmd_none higher up or use on stack value.
      
      2) A valid to none conversion via unmap:
      We can safely walk the upper level table, because we don't remove the the
      page table entries until rcu grace period. So even if we followed a
      wrong pointer we still have the pointer valid till the grace period.
      
      A PTE pointer returned need to be atomically checked for _PAGE_PRESENT and
       _PAGE_BUSY. A valid pointer returned could becoming none later. To prevent
      pte_clear we take _PAGE_BUSY.
      
      3) THP split:
      A valid transparent hugepage is converted to nomal page. Before we split we
      do pmd_splitting_flush, which sets the hugepage PTE to _PAGE_SPLITTING
      So when walking page table we need to check for pmd_trans_splitting and
      handle that. The pte returned should also need to be checked for
      _PAGE_SPLITTING before setting _PAGE_BUSY similar to _PAGE_PRESENT. We save
      the value of PTE on stack and check for the flag in the local pte value.
      If we don't have the value set we can safely operate on the local pte value
      and we atomicaly set _PAGE_BUSY.
      
      4) THP collapse:
      A normal page gets converted to hugepage. In the collapse path, we
      mark the pmd none early (pmdp_clear_flush). With irq disabled, if we
      are aleady walking page table we would see the pmd_none and won't continue.
      If we see a valid PMD, we should still check for _PAGE_PRESENT before
      setting _PAGE_BUSY, to make sure we didn't collapse the PTE to a Huge PTE.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      0ac52dd7
    • A
      powerpc/THP: Add code to handle HPTE faults for hugepages · 6d492ecc
      Aneesh Kumar K.V 提交于
      The deposted PTE page in the second half of the PMD table is used to
      track the state on hash PTEs. After updating the HPTE, we mark the
      coresponding slot in the deposted PTE page valid.
      Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      6d492ecc
    • A
      powerpc: Replace find_linux_pte with find_linux_pte_or_hugepte · 12bc9f6f
      Aneesh Kumar K.V 提交于
      Replace find_linux_pte with find_linux_pte_or_hugepte and explicitly
      document why we don't need to handle transparent hugepages at callsites.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      12bc9f6f
    • A
      powerpc/mm: handle hugepage size correctly when invalidating hpte entries · db3d8534
      Aneesh Kumar K.V 提交于
      If a hash bucket gets full, we "evict" a more/less random entry from it.
      When we do that we don't invalidate the TLB (hpte_remove) because we assume
      the old translation is still technically "valid". This implies that when
      we are invalidating or updating pte, even if HPTE entry is not valid
      we should do a tlb invalidate. With hugepages, we need to pass the correct
      actual page size value for tlb invalidation.
      
      This change update the patch 0608d692
      "powerpc/mm: Always invalidate tlb on hpte invalidate and update" to handle
      transparent hugepages correctly.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      db3d8534
  16. 14 5月, 2013 1 次提交
    • L
      powerpc: Exception hooks for context tracking subsystem · ba12eede
      Li Zhong 提交于
      This is the exception hooks for context tracking subsystem, including
      data access, program check, single step, instruction breakpoint, machine check,
      alignment, fp unavailable, altivec assist, unknown exception, whose handlers
      might use RCU.
      
      This patch corresponds to
      [PATCH] x86: Exception hooks for userspace RCU extended QS
        commit 6ba3c97a
      
      But after the exception handling moved to generic code, and some changes in
      following two commits:
      56dd9470
        context_tracking: Move exception handling to generic code
      6c1e0256
        context_tracking: Restore correct previous context state on exception exit
      
      it is able for exception hooks to use the generic code above instead of a
      redundant arch implementation.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      ba12eede
  17. 06 5月, 2013 1 次提交
  18. 30 4月, 2013 3 次提交
  19. 18 4月, 2013 2 次提交
    • L
      powerpc: Try to insert the hptes repeatedly in kernel_map_linear_page() · 016af59f
      Li Zhong 提交于
      This patch fixes the following oops, which could be trigged by build the kernel
      with many concurrent threads, under CONFIG_DEBUG_PAGEALLOC.
      
      hpte_insert() might return -1, indicating that the bucket (primary here)
      is full. We are not necessarily reporting a BUG in this case. Instead, we could
      try repeatedly (try secondary, remove and try again) until we find a slot.
      
      [  543.075675] ------------[ cut here ]------------
      [  543.075701] kernel BUG at arch/powerpc/mm/hash_utils_64.c:1239!
      [  543.075714] Oops: Exception in kernel mode, sig: 5 [#1]
      [  543.075722] PREEMPT SMP NR_CPUS=16 DEBUG_PAGEALLOC NUMA pSeries
      [  543.075741] Modules linked in: binfmt_misc ehea
      [  543.075759] NIP: c000000000036eb0 LR: c000000000036ea4 CTR: c00000000005a594
      [  543.075771] REGS: c0000000a90832c0 TRAP: 0700   Not tainted  (3.8.0-next-20130222)
      [  543.075781] MSR: 8000000000029032 <SF,EE,ME,IR,DR,RI>  CR: 22224482  XER: 00000000
      [  543.075816] SOFTE: 0
      [  543.075823] CFAR: c00000000004c200
      [  543.075830] TASK = c0000000e506b750[23934] 'cc1' THREAD: c0000000a9080000 CPU: 1
      GPR00: 0000000000000001 c0000000a9083540 c000000000c600a8 ffffffffffffffff
      GPR04: 0000000000000050 fffffffffffffffa c0000000a90834e0 00000000004ff594
      GPR08: 0000000000000001 0000000000000000 000000009592d4d8 c000000000c86854
      GPR12: 0000000000000002 c000000006ead300 0000000000a51000 0000000000000001
      GPR16: f000000003354380 ffffffffffffffff ffffffffffffff80 0000000000000000
      GPR20: 0000000000000001 c000000000c600a8 0000000000000001 0000000000000001
      GPR24: 0000000003354380 c000000000000000 0000000000000000 c000000000b65950
      GPR28: 0000002000000000 00000000000cd50e 0000000000bf50d9 c000000000c7c230
      [  543.076005] NIP [c000000000036eb0] .kernel_map_pages+0x1e0/0x3f8
      [  543.076016] LR [c000000000036ea4] .kernel_map_pages+0x1d4/0x3f8
      [  543.076025] Call Trace:
      [  543.076033] [c0000000a9083540] [c000000000036ea4] .kernel_map_pages+0x1d4/0x3f8 (unreliable)
      [  543.076053] [c0000000a9083640] [c000000000167638] .get_page_from_freelist+0x6cc/0x8dc
      [  543.076067] [c0000000a9083800] [c000000000167a48] .__alloc_pages_nodemask+0x200/0x96c
      [  543.076082] [c0000000a90839c0] [c0000000001ade44] .alloc_pages_vma+0x160/0x1e4
      [  543.076098] [c0000000a9083a80] [c00000000018ce04] .handle_pte_fault+0x1b0/0x7e8
      [  543.076113] [c0000000a9083b50] [c00000000018d5a8] .handle_mm_fault+0x16c/0x1a0
      [  543.076129] [c0000000a9083c00] [c0000000007bf1dc] .do_page_fault+0x4d0/0x7a4
      [  543.076144] [c0000000a9083e30] [c0000000000090e8] handle_page_fault+0x10/0x30
      [  543.076155] Instruction dump:
      [  543.076163] 7c630038 78631d88 e80a0000 f8410028 7c0903a6 e91f01de e96a0010 e84a0008
      [  543.076192] 4e800421 e8410028 7c7107b4 7a200fe0 <0b000000> 7f63db78 48785781 60000000
      [  543.076224] ---[ end trace bd5807e8d6ae186b ]---
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      016af59f
    • L
      powerpc: Split the code trying to insert hpte repeatedly as an helper function · b170bd3d
      Li Zhong 提交于
      Move the logic trying to insert hpte in __hash_page_huge() to an helper
      function, so it could also be used by others.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      b170bd3d
  20. 17 3月, 2013 1 次提交
    • A
      powerpc: Update kernel VSID range · c60ac569
      Aneesh Kumar K.V 提交于
      This patch change the kernel VSID range so that we limit VSID_BITS to 37.
      This enables us to support 64TB with 65 bit VA (37+28). Without this patch
      we have boot hangs on platforms that only support 65 bit VA.
      
      With this patch we now have proto vsid generated as below:
      
      We first generate a 37-bit "proto-VSID". Proto-VSIDs are generated
      from mmu context id and effective segment id of the address.
      
      For user processes max context id is limited to ((1ul << 19) - 5)
      for kernel space, we use the top 4 context ids to map address as below
      0x7fffc -  [ 0xc000000000000000 - 0xc0003fffffffffff ]
      0x7fffd -  [ 0xd000000000000000 - 0xd0003fffffffffff ]
      0x7fffe -  [ 0xe000000000000000 - 0xe0003fffffffffff ]
      0x7ffff -  [ 0xf000000000000000 - 0xf0003fffffffffff ]
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NGeoff Levand <geoff@infradead.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: <stable@vger.kernel.org> [v3.8]
      c60ac569
  21. 13 3月, 2013 1 次提交
  22. 15 2月, 2013 1 次提交
  23. 17 9月, 2012 2 次提交
  24. 05 9月, 2012 1 次提交
  25. 29 3月, 2012 1 次提交
  26. 21 3月, 2012 1 次提交
  27. 23 2月, 2012 1 次提交
    • M
      fadump: Register for firmware assisted dump. · 3ccc00a7
      Mahesh Salgaonkar 提交于
      On 2012-02-20 11:02:51 Mon, Paul Mackerras wrote:
      > On Thu, Feb 16, 2012 at 04:44:30PM +0530, Mahesh J Salgaonkar wrote:
      >
      > If I have read the code correctly, we are going to get this printk on
      > non-pSeries machines or on older pSeries machines, even if the user
      > has not put the fadump=on option on the kernel command line.  The
      > printk will be annoying since there is no actual error condition.  It
      > seems to me that the condition for the printk should include
      > fw_dump.fadump_enabled.  In other words you should probably add
      >
      > 	if (!fw_dump.fadump_enabled)
      > 		return 0;
      >
      > at the beginning of the function.
      
      Hi Paul,
      
      Thanks for pointing it out. Please find the updated patch below.
      
      The existing patches above this (4/10 through 10/10) cleanly applies
      on this update.
      
      Thanks,
      -Mahesh.
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      3ccc00a7