1. 01 5月, 2016 4 次提交
    • A
      powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED · ac29c640
      Aneesh Kumar K.V 提交于
      _PAGE_PRIVILEGED means the page can be accessed only by the kernel. This
      is done to keep pte bits similar to PowerISA 3.0 Radix PTE format. User
      pages are now marked by clearing _PAGE_PRIVILEGED bit.
      
      Previously we allowed the kernel to have a privileged page in the lower
      address range (USER_REGION). With this patch such access is denied.
      
      We also prevent a kernel access to a non-privileged page in higher
      address range (ie, REGION_ID != 0).
      
      Both the above access scenarios should never happen.
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jeremy Kerr <jk@ozlabs.org>
      Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ac29c640
    • A
      powerpc/mm: Use _PAGE_READ to indicate Read access · c7d54842
      Aneesh Kumar K.V 提交于
      This splits the _PAGE_RW bit into _PAGE_READ and _PAGE_WRITE. It also
      removes the dependency on _PAGE_USER for implying read only. Few things
      to note here is that, we have read implied with write and execute
      permission. Hence we should always find _PAGE_READ set on hash pte
      fault.
      
      We still can't switch PROT_NONE to !(_PAGE_RWX). Auto numa depends on
      marking a prot none pte _PAGE_WRITE. (For more details look at
      b191f9b1 "mm: numa: preserve PTE write permissions across a NUMA
      hinting fault")
      
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Jeremy Kerr <jk@ozlabs.org>
      Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
      Acked-by: NIan Munsie <imunsie@au1.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c7d54842
    • A
      powerpc/mm: Use big endian Linux page tables for book3s 64 · 5dc1ef85
      Aneesh Kumar K.V 提交于
      Traditionally Power server machines have used the Hashed Page Table MMU
      mode. In this mode Linux manages its own tree of nested page tables,
      aka. "the Linux page tables", which are not used by the hardware
      directly, and software loads translations into the hash page table for
      use by the hardware.
      
      Power ISA 3.0 defines a new MMU mode, known as Radix Tree Translation,
      where the hardware can directly operate on the Linux page tables.
      However the hardware requires that the page tables be in big endian
      format.
      
      To accommodate this, switch the pgtable types to __be64 and add
      appropriate endian conversions.
      
      Because we will be supporting a single kernel binary that boots using
      either radix or hash mode, we always store the Linux page tables big
      endian, even in hash mode where they are not actually used by the
      hardware.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Fix sparse errors, flesh out change log]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      5dc1ef85
    • A
      powerpc/mm: Drop PTE_ATOMIC_UPDATES from pmd_hugepage_update() · 4bece39b
      Aneesh Kumar K.V 提交于
      pmd_hugepage_update() is inside #ifdef CONFIG_TRANSPARENT_HUGEPAGE. THP
      can only be enabled if PPC_BOOK3S_64=y && PPC_64K_PAGES=y, aka. hash64.
      
      On hash64 we always define PTE_ATOMIC_UPDATES to 1, meaning the #ifdef
      in pmd_hugepage_update() is unnecessary, so drop it.
      
      That is also the only use of PTE_ATOMIC_UPDATES in any of the hash code,
      meaning we no longer need to #define it at all in the hash headers.
      
      Note it's still #defined and used in the nohash code.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      4bece39b
  2. 18 3月, 2016 1 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
  3. 03 3月, 2016 1 次提交
  4. 29 2月, 2016 1 次提交
  5. 27 2月, 2016 1 次提交
    • P
      powerpc/mm/book3s-64: Free up 7 high-order bits in the Linux PTE · f1a9ae03
      Paul Mackerras 提交于
      This frees up bits 57-63 in the Linux PTE on 64-bit Book 3S machines.
      In the 4k page case, this is done just by reducing the size of the
      RPN field to 39 bits, giving 51-bit real addresses.  In the 64k page
      case, we had 10 unused bits in the middle of the PTE, so this moves
      the RPN field down 10 bits to make use of those unused bits.  This
      means the RPN field is now 3 bits larger at 37 bits, giving 53-bit
      real addresses in the normal case, or 49-bit real addresses for the
      special 4k PFN case.
      
      We are doing this in order to be able to move some other PTE bits
      into the positions where PowerISA V3.0 processors will expect to
      find them in radix-tree mode.  Ultimately we will be able to move
      the RPN field to lower bit positions and make it larger.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f1a9ae03
  6. 15 2月, 2016 1 次提交
    • A
      powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8
      Aneesh Kumar K.V 提交于
      With ppc64 we use the deposited pgtable_t to store the hash pte slot
      information. We should not withdraw the deposited pgtable_t without
      marking the pmd none. This ensure that low level hash fault handling
      will skip this huge pte and we will handle them at upper levels.
      
      Recent change to pmd splitting changed the above in order to handle the
      race between pmd split and exit_mmap. The race is explained below.
      
      Consider following race:
      
      		CPU0				CPU1
      shrink_page_list()
        add_to_swap()
          split_huge_page_to_list()
            __split_huge_pmd_locked()
              pmdp_huge_clear_flush_notify()
      	// pmd_none() == true
      					exit_mmap()
      					  unmap_vmas()
      					    zap_pmd_range()
      					      // no action on pmd since pmd_none() == true
      	pmd_populate()
      
      As result the THP will not be freed. The leak is detected by check_mm():
      
      	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
      
      The above required us to not mark pmd none during a pmd split.
      
      The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
      level fault handling code skip this pte. At higher level we do take ptl
      lock. That should serialze us against the pmd split. Once the lock is
      acquired we do check the pmd again using pmd_same. That should always
      return false for us and hence we should retry the access. We do the
      pmd_same check in all case after taking plt with
      THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
      huge_pmd_set_accessed)
      
      Also make sure we wait for irq disable section in other cpus to finish
      before flipping a huge pte entry with a regular pmd entry. Code paths
      like find_linux_pte_or_hugepte depend on irq disable to get
      a stable pte_t pointer. A parallel thp split need to make sure we
      don't convert a pmd pte to a regular pmd entry without waiting for the
      irq disable section to finish.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c777e2a8
  7. 16 1月, 2016 1 次提交
  8. 14 12月, 2015 4 次提交
  9. 08 8月, 2015 1 次提交
  10. 25 6月, 2015 3 次提交
  11. 12 5月, 2015 1 次提交
    • A
      powerpc/thp: Serialize pmd clear against a linux page table walk. · 13bd817b
      Aneesh Kumar K.V 提交于
      Serialize against find_linux_pte_or_hugepte() which does lock-less
      lookup in page tables with local interrupts disabled. For huge pages it
      casts pmd_t to pte_t. Since the format of pte_t is different from pmd_t
      we want to prevent transit from pmd pointing to page table to pmd
      pointing to huge page (and back) while interrupts are disabled.  We
      clear pmd to possibly replace it with page table pointer in different
      code paths. So make sure we wait for the parallel
      find_linux_pte_or_hugepage() to finish.
      
      Without this patch, a find_linux_pte_or_hugepte() running in parallel to
      __split_huge_zero_page_pmd() or do_huge_pmd_wp_page_fallback() or
      zap_huge_pmd() can run into the above issue. With
      __split_huge_zero_page_pmd() and do_huge_pmd_wp_page_fallback() we clear
      the hugepage pte before inserting the pmd entry with a regular pgtable
      address. Such a clear need to wait for the parallel
      find_linux_pte_or_hugepte() to finish.
      
      With zap_huge_pmd(), we can run into issues, with a hugepage pte getting
      zapped due to a MADV_DONTNEED while other cpu fault it in as small
      pages.
      Reported-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      13bd817b
  12. 10 4月, 2015 2 次提交
  13. 17 2月, 2015 1 次提交
  14. 13 2月, 2015 1 次提交
  15. 05 12月, 2014 1 次提交
    • A
      powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688
      Aneesh Kumar K.V 提交于
      upatepp can get called for a nohpte fault when we find from the linux
      page table that the translation was hashed before. In that case
      we are sure that there is no existing translation, hence we could
      avoid doing tlbie.
      
      We could possibly race with a parallel fault filling the TLB. But
      that should be ok because updatepp is only ever relaxing permissions.
      We also look at linux pte permission bits when filling hash pte
      permission bits. We also hold the linux pte busy bits while
      inserting/updating a hashpte entry, hence a paralle update of
      linux pte is not possible. On the other hand mprotect involves
      ptep_modify_prot_start which cause a hpte invalidate and not updatepp.
      
      Performance number:
      We use randbox_access_bench written by Anton.
      
      Kernel with THP disabled and smaller hash page table size.
      
          86.60%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_updatepp
           2.10%  random_access_b  random_access_bench              [.] doit
           1.99%  random_access_b  [kernel.kallsyms]                [k] .do_raw_spin_lock
           1.85%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           1.26%  random_access_b  [kernel.kallsyms]                [k] .native_flush_hash_range
           1.18%  random_access_b  [kernel.kallsyms]                [k] .__delay
           0.69%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           0.37%  random_access_b  [kernel.kallsyms]                [k] .clear_user_page
           0.34%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           0.32%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           0.30%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
      
      With Fix:
      
          27.54%  random_access_b  random_access_bench              [.] doit
          22.90%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_insert
           5.76%  random_access_b  [kernel.kallsyms]                [k] .native_hpte_remove
           5.20%  random_access_b  [kernel.kallsyms]                [k] fast_exception_return
           5.12%  random_access_b  [kernel.kallsyms]                [k] .__hash_page_64K
           4.80%  random_access_b  [kernel.kallsyms]                [k] .hash_page_mm
           3.31%  random_access_b  [kernel.kallsyms]                [k] data_access_common
           1.84%  random_access_b  [kernel.kallsyms]                [k] .trace_hardirqs_on_caller
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      aefa5688
  16. 02 12月, 2014 2 次提交
  17. 14 11月, 2014 1 次提交
  18. 10 11月, 2014 3 次提交
  19. 13 8月, 2014 3 次提交
  20. 05 8月, 2014 1 次提交
  21. 24 3月, 2014 1 次提交
  22. 17 2月, 2014 1 次提交
  23. 15 1月, 2014 1 次提交
  24. 10 1月, 2014 1 次提交
    • S
      powerpc: add barrier after writing kernel PTE · 47ce8af4
      Scott Wood 提交于
      There is no barrier between something like ioremap() writing to
      a PTE, and returning the value to a caller that may then store the
      pointer in a place that is visible to other CPUs.  Such callers
      generally don't perform barriers of their own.
      
      Even if callers of ioremap() and similar things did use barriers,
      the most logical choise would be smp_wmb(), which is not
      architecturally sufficient when BookE hardware tablewalk is used.  A
      full sync is specified by the architecture.
      
      For userspace mappings, OTOH, we generally already have an lwsync due
      to locking, and if we occasionally take a spurious fault due to not
      having a full sync with hardware tablewalk, it will not be fatal
      because we will retry rather than oops.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      47ce8af4
  25. 09 12月, 2013 1 次提交
  26. 15 11月, 2013 1 次提交