1. 15 2月, 2016 1 次提交
    • A
      powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8
      Aneesh Kumar K.V 提交于
      With ppc64 we use the deposited pgtable_t to store the hash pte slot
      information. We should not withdraw the deposited pgtable_t without
      marking the pmd none. This ensure that low level hash fault handling
      will skip this huge pte and we will handle them at upper levels.
      
      Recent change to pmd splitting changed the above in order to handle the
      race between pmd split and exit_mmap. The race is explained below.
      
      Consider following race:
      
      		CPU0				CPU1
      shrink_page_list()
        add_to_swap()
          split_huge_page_to_list()
            __split_huge_pmd_locked()
              pmdp_huge_clear_flush_notify()
      	// pmd_none() == true
      					exit_mmap()
      					  unmap_vmas()
      					    zap_pmd_range()
      					      // no action on pmd since pmd_none() == true
      	pmd_populate()
      
      As result the THP will not be freed. The leak is detected by check_mm():
      
      	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
      
      The above required us to not mark pmd none during a pmd split.
      
      The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
      level fault handling code skip this pte. At higher level we do take ptl
      lock. That should serialze us against the pmd split. Once the lock is
      acquired we do check the pmd again using pmd_same. That should always
      return false for us and hence we should retry the access. We do the
      pmd_same check in all case after taking plt with
      THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
      huge_pmd_set_accessed)
      
      Also make sure we wait for irq disable section in other cpus to finish
      before flipping a huge pte entry with a regular pmd entry. Code paths
      like find_linux_pte_or_hugepte depend on irq disable to get
      a stable pte_t pointer. A parallel thp split need to make sure we
      don't convert a pmd pte to a regular pmd entry without waiting for the
      irq disable section to finish.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c777e2a8
  2. 25 1月, 2016 1 次提交
  3. 16 1月, 2016 3 次提交
  4. 09 1月, 2016 1 次提交
  5. 27 12月, 2015 1 次提交
  6. 23 12月, 2015 1 次提交
  7. 19 12月, 2015 1 次提交
  8. 14 12月, 2015 15 次提交
  9. 06 11月, 2015 1 次提交
    • R
      arch/powerpc/mm/numa.c: do not allocate bootmem memory for non existing nodes · c118baf8
      Raghavendra K T 提交于
      With the setup_nr_nodes(), we have already initialized
      node_possible_map.  So it is safe to use for_each_node here.
      
      There are many places in the kernel that use hardcoded 'for' loop with
      nr_node_ids, because all other architectures have numa nodes populated
      serially.  That should be reason we had maintained the same for
      powerpc.
      
      But, since sparse numa node ids possible on powerpc, we unnecessarily
      allocate memory for non existent numa nodes.
      
      For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and
      we allocate memory for nodes 2-14.  This patch we allocate memory for
      only existing numa nodes.
      
      The patch is boot tested on a 4 node tuleta, confirming with printks
      that it works as expected.
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Greg Kurz <gkurz@linux.vnet.ibm.com>
      Cc: Grant Likely <grant.likely@linaro.org>
      Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c118baf8
  10. 28 10月, 2015 2 次提交
    • K
      powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry · e1f580e8
      Kevin Hao 提交于
      In order to workaround Erratum A-008139, we have to invalidate the
      tlb entry with tlbilx before overwriting. Due to the performance
      consideration, we don't add any memory barrier when acquire/release
      the tcd lock. This means the two load instructions for esel_next do
      have the possibility to return different value. This is definitely
      not acceptable due to the Erratum A-008139. We have two options to
      fix this issue:
        a) Add memory barrier when acquire/release tcd lock to order the
           load/store to esel_next.
        b) Just make sure to invalidate and write to the same tlb entry and
           tolerate the race that we may get the wrong value and overwrite
           the tlb entry just updated by the other thread.
      
      We observe better performance using option b. So reserve an additional
      register to save the value of the esel_next.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      e1f580e8
    • S
      powerpc/fsl-booke-64: Don't limit ppc64_rma_size to one TLB entry · eba5de8d
      Scott Wood 提交于
      This is required for kdump to work when loaded at at an address that
      does not fall within the first TLB entry -- which can easily happen
      because while the lower limit is enforced via reserved memory, which
      doesn't affect how much is mapped, the upper limit is enforced via a
      different mechanism that does.  Thus, more TLB entries are needed than
      would normally be used, as the total memory to be mapped might not be a
      power of two.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      eba5de8d
  11. 23 10月, 2015 1 次提交
    • S
      powerpc/85xx: Load all early TLB entries at once · d9e1831a
      Scott Wood 提交于
      Use an AS=1 trampoline TLB entry to allow all normal TLB1 entries to
      be loaded at once.  This avoids the need to keep the translation that
      code is executing from in the same TLB entry in the final TLB
      configuration as during early boot, which in turn is helpful for
      relocatable kernels (e.g. kdump) where the kernel is not running from
      what would be the first TLB entry.
      
      On e6500, we limit map_mem_in_cams() to the primary hwthread of a
      core (the boot cpu is always considered primary, as a kdump kernel
      can be entered on any cpu).  Each TLB only needs to be set up once,
      and when we do, we don't want another thread to be running when we
      create a temporary trampoline TLB1 entry.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      d9e1831a
  12. 15 10月, 2015 1 次提交
  13. 12 10月, 2015 2 次提交
  14. 09 10月, 2015 1 次提交
    • C
      powerpc: Fix checkstop in native_hpte_clear() with lockdep · fdf880a6
      Cyril Bur 提交于
      native_hpte_clear() is called in real mode from two places:
      - Early in boot during htab initialisation if firmware assisted dump is
        active.
      - Late in the kexec path.
      
      In both contexts there is no need to disable interrupts are they are
      already disabled. Furthermore, locking around the tlbie() is only required
      for pre POWER5 hardware.
      
      On POWER5 or newer hardware concurrent tlbie()s work as expected and on pre
      POWER5 hardware concurrent tlbie()s could result in deadlock. This code
      would only be executed at crashdump time, during which all bets are off,
      concurrent tlbie()s are unlikely and taking locks is unsafe therefore the
      best course of action is to simply do nothing. Concurrent tlbie()s are not
      possible in the first case as secondary CPUs have not come up yet.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      fdf880a6
  15. 01 10月, 2015 2 次提交
  16. 16 9月, 2015 1 次提交
  17. 28 8月, 2015 1 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
  18. 18 8月, 2015 4 次提交
    • N
      powerpc/numa: initialize distance lookup table from drconf path · 1d805440
      Nikunj A Dadhania 提交于
      In some situations, a NUMA guest that supports
      ibm,dynamic-memory-reconfiguration node will end up having flat NUMA
      distances between nodes. This is because of two problems in the
      current code.
      
      1) Different representations of associativity lists.
      
         There is an assumption about the associativity list in
         initialize_distance_lookup_table(). Associativity list has two forms:
      
         a) [cpu,memory]@x/ibm,associativity has following
            format:
                 <N> <N integers>
      
         b) ibm,dynamic-reconfiguration-memory/ibm,associativity-lookup-arrays
      
                 <M> <N> <M associativity lists each having N integers>
                 M = the number of associativity lists
                 N = the number of entries per associativity list
      
         Fix initialize_distance_lookup_table() so that it does not assume
         "case a". And update the caller to skip the length field before
         sending the associativity list.
      
      2) Distance table not getting updated from drconf path.
      
         Node distance table will not get initialized in certain cases as
         ibm,dynamic-reconfiguration-memory path does not initialize the
         lookup table.
      
         Call initialize_distance_lookup_table() from drconf path with
         appropriate associativity list.
      Reported-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: NNikunj A Dadhania <nikunj@linux.vnet.ibm.com>
      Acked-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      1d805440
    • M
      powerpc/mm: Drop CONFIG_PPC_HAS_HASH_64K · 73b341ef
      Michael Ellerman 提交于
      The relation between CONFIG_PPC_HAS_HASH_64K and CONFIG_PPC_64K_PAGES is
      painfully complicated.
      
      But if we rearrange it enough we can see that PPC_HAS_HASH_64K
      essentially depends on PPC_STD_MMU_64 && PPC_64K_PAGES.
      
      We can then notice that PPC_HAS_HASH_64K is used in files that are only
      built for PPC_STD_MMU_64, meaning it's equivalent to PPC_64K_PAGES.
      
      So replace all uses and drop it.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      73b341ef
    • M
      powerpc/cell: Drop support for 64K local store on 4K kernels · f444f1f8
      Michael Ellerman 提交于
      Back in the olden days we added support for using 64K pages to map the
      SPU (Synergistic Processing Unit) local store on Cell, when the main
      kernel was using 4K pages.
      
      This was useful at the time because distros were using 4K pages, but
      using 64K pages on the SPUs could reduce TLB pressure there.
      
      However these days the number of Cell users is approaching zero, and
      supporting this option adds unpleasant complexity to the memory
      management code.
      
      So drop the option, CONFIG_SPU_FS_64K_LS, and all related code.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NJeremy Kerr <jk@ozlabs.org>
      f444f1f8
    • K
      powerpc/e6500: hw tablewalk: optimize a bit for tcd lock acquiring codes · 69399ee9
      Kevin Hao 提交于
      It makes no sense to put the instructions for calculating the lock
      value (cpu number + 1) and the clearing of eq bit of cr1 in lbarx/stbcx
      loop. And when the lock is acquired by the other thread, the current
      lock value has no chance to equal with the lock value used by current
      cpu. So we can skip the comparing for these two lock values in the
      lbz/bne loop.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      69399ee9