1. 01 3月, 2016 4 次提交
    • D
      powerpc/mm: Clean up memory hotplug failure paths · 1dace6c6
      David Gibson 提交于
      This makes a number of cleanups to handling of mapping failures during
      memory hotplug on Power:
      
      For errors creating the linear mapping for the hot-added region:
        * This is now reported with EFAULT which is more appropriate than the
          previous EINVAL (the failure is unlikely to be related to the
          function's parameters)
        * An error in this path now prints a warning message, rather than just
          silently failing to add the extra memory.
        * Previously a failure here could result in the region being partially
          mapped.  We now clean up any partial mapping before failing.
      
      For errors creating the vmemmap for the hot-added region:
         * This is now reported with EFAULT instead of causing a BUG() - this
           could happen for external reason (e.g. full hash table) so it's better
           to handle this non-fatally
         * An error message is also printed, so the failure won't be silent
         * As above a failure could cause a partially mapped region, we now
           clean this up. [mpe: move htab_remove_mapping() out of #ifdef
           CONFIG_MEMORY_HOTPLUG to enable this]
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1dace6c6
    • D
      powerpc/mm: Handle removing maybe-present bolted HPTEs · 27828f98
      David Gibson 提交于
      At the moment the hpte_removebolted callback in ppc_md returns void and
      will BUG_ON() if the hpte it's asked to remove doesn't exist in the first
      place.  This is awkward for the case of cleaning up a mapping which was
      partially made before failing.
      
      So, we add a return value to hpte_removebolted, and have it return ENOENT
      in the case that the HPTE to remove didn't exist in the first place.
      
      In the (sole) caller, we propagate errors in hpte_removebolted to its
      caller to handle.  However, we handle ENOENT specially, continuing to
      complete the unmapping over the specified range before returning the error
      to the caller.
      
      This means that htab_remove_mapping() will work sanely on a partially
      present mapping, removing any HPTEs which are present, while also returning
      ENOENT to its caller in case it's important there.
      
      There are two callers of htab_remove_mapping():
         - In remove_section_mapping() we already WARN_ON() any error return,
           which is reasonable - in this case the mapping should be fully
           present
         - In vmemmap_remove_mapping() we BUG_ON() any error.  We change that to
           just a WARN_ON() in the case of ENOENT, since failing to remove a
           mapping that wasn't there in the first place probably shouldn't be
           fatal.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      27828f98
    • D
      powerpc/mm: Clean up error handling for htab_remove_mapping · abd0a0e7
      David Gibson 提交于
      Currently, the only error that htab_remove_mapping() can report is -EINVAL,
      if removal of bolted HPTEs isn't implemeted for this platform.  We make
      a few clean ups to the handling of this:
      
       * EINVAL isn't really the right code - there's nothing wrong with the
         function's arguments - use ENODEV instead
       * We were also printing a warning message, but that's a decision better
         left up to the callers, so remove it
       * One caller is vmemmap_remove_mapping(), which will just BUG_ON() on
         error, making the warning message redundant, so no change is needed
         there.
       * The other caller is remove_section_mapping().  This is called in the
         memory hot remove path at a point after vmemmap_remove_mapping() so
         if hpte_removebolted isn't implemented, we'd expect to have already
         BUG()ed anyway.  Put a WARN_ON() here, in lieu of a printk() since this
         really shouldn't be happening.
      Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      abd0a0e7
    • A
      446957ba
  2. 29 2月, 2016 3 次提交
  3. 27 2月, 2016 2 次提交
  4. 22 2月, 2016 1 次提交
    • A
      powerpc/mm/hash: Clear the invalid slot information correctly · 9ab3ac23
      Aneesh Kumar K.V 提交于
      We can get a hash pte fault with 4k base page size and find the pte
      already inserted with 64K base page size. In that case we need to clear
      the existing slot information from the old pte. Fix this correctly
      
      With THP, we also clear the slot information with respect to all
      the 64K hash pte mapping that 16MB page. They are all invalid
      now. This make sure we don't find the slot valid when we fault with
      4k base page size. Finding the slot valid should not result in any wrong
      behavior because we do check again in hash page table for the validity.
      But we can avoid that check completely.
      
      Fixes: a43c0eb8 ("powerpc/mm: Convert 4k hash insert to C")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      9ab3ac23
  5. 15 2月, 2016 1 次提交
    • A
      powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8
      Aneesh Kumar K.V 提交于
      With ppc64 we use the deposited pgtable_t to store the hash pte slot
      information. We should not withdraw the deposited pgtable_t without
      marking the pmd none. This ensure that low level hash fault handling
      will skip this huge pte and we will handle them at upper levels.
      
      Recent change to pmd splitting changed the above in order to handle the
      race between pmd split and exit_mmap. The race is explained below.
      
      Consider following race:
      
      		CPU0				CPU1
      shrink_page_list()
        add_to_swap()
          split_huge_page_to_list()
            __split_huge_pmd_locked()
              pmdp_huge_clear_flush_notify()
      	// pmd_none() == true
      					exit_mmap()
      					  unmap_vmas()
      					    zap_pmd_range()
      					      // no action on pmd since pmd_none() == true
      	pmd_populate()
      
      As result the THP will not be freed. The leak is detected by check_mm():
      
      	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
      
      The above required us to not mark pmd none during a pmd split.
      
      The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
      level fault handling code skip this pte. At higher level we do take ptl
      lock. That should serialze us against the pmd split. Once the lock is
      acquired we do check the pmd again using pmd_same. That should always
      return false for us and hence we should retry the access. We do the
      pmd_same check in all case after taking plt with
      THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
      huge_pmd_set_accessed)
      
      Also make sure we wait for irq disable section in other cpus to finish
      before flipping a huge pte entry with a regular pmd entry. Code paths
      like find_linux_pte_or_hugepte depend on irq disable to get
      a stable pte_t pointer. A parallel thp split need to make sure we
      don't convert a pmd pte to a regular pmd entry without waiting for the
      irq disable section to finish.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c777e2a8
  6. 25 1月, 2016 1 次提交
  7. 16 1月, 2016 3 次提交
  8. 09 1月, 2016 1 次提交
  9. 27 12月, 2015 1 次提交
  10. 23 12月, 2015 1 次提交
  11. 19 12月, 2015 1 次提交
  12. 14 12月, 2015 15 次提交
  13. 06 11月, 2015 1 次提交
    • R
      arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes · c118baf8
      Raghavendra K T 提交于
      With the setup_nr_nodes(), we have already initialized
      node_possible_map.  So it is safe to use for_each_node here.
      
      There are many places in the kernel that use hardcoded 'for' loop with
      nr_node_ids, because all other architectures have numa nodes populated
      serially.  That should be reason we had maintained the same for
      powerpc.
      
      But, since sparse numa node ids possible on powerpc, we unnecessarily
      allocate memory for non existent numa nodes.
      
      For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
      we allocate memory for nodes 2-14.  This patch we allocate memory for
      only existing numa nodes.
      
      The patch is boot tested on a 4 node tuleta, confirming with printks
      that it works as expected.
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Greg Kurz <gkurz@linux.vnet.ibm.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c118baf8
  14. 28 10月, 2015 2 次提交
    • K
      powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry · e1f580e8
      Kevin Hao 提交于
      In order to workaround Erratum A-008139, we have to invalidate the
      tlb entry with tlbilx before overwriting. Due to the performance
      consideration, we don't add any memory barrier when acquire/release
      the tcd lock. This means the two load instructions for esel_next do
      have the possibility to return different value. This is definitely
      not acceptable due to the Erratum A-008139. We have two options to
      fix this issue:
        a) Add memory barrier when acquire/release tcd lock to order the
           load/store to esel_next.
        b) Just make sure to invalidate and write to the same tlb entry and
           tolerate the race that we may get the wrong value and overwrite
           the tlb entry just updated by the other thread.
      
      We observe better performance using option b. So reserve an additional
      register to save the value of the esel_next.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      e1f580e8
    • S
      powerpc/fsl-booke-64: Don't limit ppc64_rma_size to one TLB entry · eba5de8d
      Scott Wood 提交于
      This is required for kdump to work when loaded at at an address that
      does not fall within the first TLB entry -- which can easily happen
      because while the lower limit is enforced via reserved memory, which
      doesn't affect how much is mapped, the upper limit is enforced via a
      different mechanism that does.  Thus, more TLB entries are needed than
      would normally be used, as the total memory to be mapped might not be a
      power of two.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      eba5de8d
  15. 23 10月, 2015 1 次提交
    • S
      powerpc/85xx: Load all early TLB entries at once · d9e1831a
      Scott Wood 提交于
      Use an AS=1 trampoline TLB entry to allow all normal TLB1 entries to
      be loaded at once.  This avoids the need to keep the translation that
      code is executing from in the same TLB entry in the final TLB
      configuration as during early boot, which in turn is helpful for
      relocatable kernels (e.g. kdump) where the kernel is not running from
      what would be the first TLB entry.
      
      On e6500, we limit map_mem_in_cams() to the primary hwthread of a
      core (the boot cpu is always considered primary, as a kdump kernel
      can be entered on any cpu).  Each TLB only needs to be set up once,
      and when we do, we don't want another thread to be running when we
      create a temporary trampoline TLB1 entry.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      d9e1831a
  16. 15 10月, 2015 1 次提交
  17. 12 10月, 2015 1 次提交
    • A
      powerpc/mm: Differentiate between hugetlb and THP during page walk · 891121e6
      Aneesh Kumar K.V 提交于
      We need to properly identify whether a hugepage is an explicit or
      a transparent hugepage in follow_huge_addr(). We used to depend
      on hugepage shift argument to do that. But in some case that can
      result in wrong results. For ex:
      
      On finding a transparent hugepage we set hugepage shift to PMD_SHIFT.
      But we can end up clearing the thp pte, via pmdp_huge_get_and_clear.
      We do prevent reusing the pfn page via the usage of
      kick_all_cpus_sync(). But that happens after we updated the pte to 0.
      Hence in follow_huge_addr() we can find hugepage shift set, but transparent
      huge page check fail for a thp pte.
      
      NOTE: We fixed a variant of this race against thp split in commit
      691e95fd
      ("powerpc/mm/thp: Make page table walk safe against thp split/collapse")
      
      Without this patch, we may hit the BUG_ON(flags & FOLL_GET) in
      follow_page_mask occasionally.
      
      In the long term, we may want to switch ppc64 64k page size config to
      enable CONFIG_ARCH_WANT_GENERAL_HUGETLB
      Reported-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      891121e6