提交 · ac29c64089b74d107edb90879e63a2f7a03cd66b · openeuler / Kernel

01 5月, 2016 4 次提交

powerpc/mm: Replace _PAGE_USER with _PAGE_PRIVILEGED · ac29c640

由 Aneesh Kumar K.V 提交于 4月 29, 2016

_PAGE_PRIVILEGED means the page can be accessed only by the kernel. This
is done to keep pte bits similar to PowerISA 3.0 Radix PTE format. User
pages are now marked by clearing _PAGE_PRIVILEGED bit.

Previously we allowed the kernel to have a privileged page in the lower
address range (USER_REGION). With this patch such access is denied.

We also prevent a kernel access to a non-privileged page in higher
address range (ie, REGION_ID != 0).

Both the above access scenarios should never happen.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: NIan Munsie <imunsie@au1.ibm.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ac29c640

powerpc/mm: Use _PAGE_READ to indicate Read access · c7d54842

由 Aneesh Kumar K.V 提交于 4月 29, 2016

This splits the _PAGE_RW bit into _PAGE_READ and _PAGE_WRITE. It also
removes the dependency on _PAGE_USER for implying read only. Few things
to note here is that, we have read implied with write and execute
permission. Hence we should always find _PAGE_READ set on hash pte
fault.

We still can't switch PROT_NONE to !(_PAGE_RWX). Auto numa depends on
marking a prot none pte _PAGE_WRITE. (For more details look at
b191f9b1 "mm: numa: preserve PTE write permissions across a NUMA
hinting fault")

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jeremy Kerr <jk@ozlabs.org>
Cc: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Acked-by: NIan Munsie <imunsie@au1.ibm.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c7d54842

powerpc/mm: Use big endian Linux page tables for book3s 64 · 5dc1ef85

由 Aneesh Kumar K.V 提交于 4月 29, 2016

Traditionally Power server machines have used the Hashed Page Table MMU
mode. In this mode Linux manages its own tree of nested page tables,
aka. "the Linux page tables", which are not used by the hardware
directly, and software loads translations into the hash page table for
use by the hardware.

Power ISA 3.0 defines a new MMU mode, known as Radix Tree Translation,
where the hardware can directly operate on the Linux page tables.
However the hardware requires that the page tables be in big endian
format.

To accommodate this, switch the pgtable types to __be64 and add
appropriate endian conversions.

Because we will be supporting a single kernel binary that boots using
either radix or hash mode, we always store the Linux page tables big
endian, even in hash mode where they are not actually used by the
hardware.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix sparse errors, flesh out change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5dc1ef85

powerpc/mm: Drop PTE_ATOMIC_UPDATES from pmd_hugepage_update() · 4bece39b

由 Aneesh Kumar K.V 提交于 4月 29, 2016

pmd_hugepage_update() is inside #ifdef CONFIG_TRANSPARENT_HUGEPAGE. THP
can only be enabled if PPC_BOOK3S_64=y && PPC_64K_PAGES=y, aka. hash64.

On hash64 we always define PTE_ATOMIC_UPDATES to 1, meaning the #ifdef
in pmd_hugepage_update() is unnecessary, so drop it.

That is also the only use of PTE_ATOMIC_UPDATES in any of the hash code,
meaning we no longer need to #define it at all in the hash headers.

Note it's still #defined and used in the nohash code.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4bece39b

18 3月, 2016 1 次提交

mm: introduce page reference manipulation functions · fe896d18

由 Joonsoo Kim 提交于 3月 17, 2016

The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count.  Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it.  Then, it is hard to find actual
reason of CMA allocation failure.  CMA allocation should be guaranteed
to succeed so finding offending place is really important.

In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function.  This is preparation step to
add tracepoint to each page reference manipulation function.  With this
facility, we can easily find reason of CMA allocation failure.  There is
no functional change in this patch.

In addition, this patch also converts reference read sites.  It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).
Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Acked-by: NMichal Nazarewicz <mina86@mina86.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fe896d18

03 3月, 2016 1 次提交

mm: Some arch may want to use HPAGE_PMD related values as variables · ff20c2e0

由 Kirill A. Shutemov 提交于 3月 01, 2016

With next generation power processor, we are having a new mmu model
[1] that require us to maintain a different linux page table format.

Inorder to support both current and future ppc64 systems with a single
kernel we need to make sure kernel can select between different page
table format at runtime. With the new MMU (radix MMU) added, we will
have two different pmd hugepage size 16MB for hash model and 2MB for
Radix model. Hence make HPAGE_PMD related values as a variable.

Actual conversion of HPAGE_PMD to a variable for ppc64 happens in a
followup patch.

[1] http://ibm.biz/power-isa3 (Needs registration).
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ff20c2e0

29 2月, 2016 1 次提交

powerpc/mm/book3s-64: Move _PAGE_PRESENT to the most significant bit · 849f86a6

由 Paul Mackerras 提交于 2月 22, 2016

This changes _PAGE_PRESENT for 64-bit Book 3S processors from 0x2 to
0x8000_0000_0000_0000, because that is where PowerISA v3.0 CPUs in
radix mode will expect to find it.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

849f86a6

27 2月, 2016 1 次提交

powerpc/mm/book3s-64: Free up 7 high-order bits in the Linux PTE · f1a9ae03

由 Paul Mackerras 提交于 2月 22, 2016

This frees up bits 57-63 in the Linux PTE on 64-bit Book 3S machines.
In the 4k page case, this is done just by reducing the size of the
RPN field to 39 bits, giving 51-bit real addresses.  In the 64k page
case, we had 10 unused bits in the middle of the PTE, so this moves
the RPN field down 10 bits to make use of those unused bits.  This
means the RPN field is now 3 bits larger at 37 bits, giving 53-bit
real addresses in the normal case, or 49-bit real addresses for the
special 4k PFN case.

We are doing this in order to be able to move some other PTE bits
into the positions where PowerISA V3.0 processors will expect to
find them in radix-tree mode.  Ultimately we will be able to move
the RPN field to lower bit positions and make it larger.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f1a9ae03

15 2月, 2016 1 次提交

powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8

由 Aneesh Kumar K.V 提交于 2月 09, 2016

With ppc64 we use the deposited pgtable_t to store the hash pte slot
information. We should not withdraw the deposited pgtable_t without
marking the pmd none. This ensure that low level hash fault handling
will skip this huge pte and we will handle them at upper levels.

Recent change to pmd splitting changed the above in order to handle the
race between pmd split and exit_mmap. The race is explained below.

Consider following race:

		CPU0				CPU1
shrink_page_list()
  add_to_swap()
    split_huge_page_to_list()
      __split_huge_pmd_locked()
        pmdp_huge_clear_flush_notify()
	// pmd_none() == true
					exit_mmap()
					  unmap_vmas()
					    zap_pmd_range()
					      // no action on pmd since pmd_none() == true
	pmd_populate()

As result the THP will not be freed. The leak is detected by check_mm():

	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512

The above required us to not mark pmd none during a pmd split.

The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
level fault handling code skip this pte. At higher level we do take ptl
lock. That should serialze us against the pmd split. Once the lock is
acquired we do check the pmd again using pmd_same. That should always
return false for us and hence we should retry the access. We do the
pmd_same check in all case after taking plt with
THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
huge_pmd_set_accessed)

Also make sure we wait for irq disable section in other cpus to finish
before flipping a huge pte entry with a regular pmd entry. Code paths
like find_linux_pte_or_hugepte depend on irq disable to get
a stable pte_t pointer. A parallel thp split need to make sure we
don't convert a pmd pte to a regular pmd entry without waiting for the
irq disable section to finish.

Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c777e2a8

16 1月, 2016 1 次提交

powerpc, thp: remove infrastructure for handling splitting PMDs · 7aa9a23c

由 Kirill A. Shutemov 提交于 1月 15, 2016

With new refcounting we don't need to mark PMDs splitting.  Let's drop
code to handle this.

pmdp_splitting_flush() is not needed too: on splitting PMD we will do
pmdp_clear_flush() + set_pte_at().  pmdp_clear_flush() will do IPI as
needed for fast_gup.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Jerome Marchand <jmarchan@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Steve Capper <steve.capper@linaro.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7aa9a23c

14 12月, 2015 4 次提交

powerpc/mm: Add a _PAGE_PTE bit · 6a119eae

由 Aneesh Kumar K.V 提交于 12月 01, 2015

For a pte entry we will have _PAGE_PTE set. Our pte page
address have a minimum alignment requirement of HUGEPD_SHIFT_MASK + 1.
We use the lower 7 bits to indicate hugepd. ie.

For pmd and pgd we can find:
1) _PAGE_PTE set pte -> indicate PTE
2) bits [2..6] non zero -> indicate hugepd.
   They also encode the size. We skip bit 1 (_PAGE_PRESENT).
3) othewise pointer to next table.
Acked-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6a119eae

powerpc/mm: Move THP headers around · e34aa03c

由 Aneesh Kumar K.V 提交于 12月 01, 2015

We support THP only with book3s_64 and 64K page size. Move
THP details to hash64-64k.h to clarify the same.
Acked-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e34aa03c

powerpc/mm: Don't track subpage valid bit in pte_t · bf680d51

由 Aneesh Kumar K.V 提交于 12月 01, 2015

This free up 11 bits in pte_t. In the later patch we also change
the pte_t format so that we can start supporting migration pte
at pmd level. We now track 4k subpage valid bit as below

If we have _PAGE_COMBO set, we override the _PAGE_F_GIX_SHIFT
and _PAGE_F_SECOND. Together we have 4 bits, each of them
used to indicate whether any of the 4 4k subpage in that group
is valid. ie,

[ group 1 bit ]   [ group 2 bit ]  ..... [ group 4 ]
[ subpage 1 - 4]  [ subpage 5- 8]  ..... [ subpage 13 - 16]

We still track each 4k subpage slot number and secondary hash
information in the second half of pgtable_t. Removing the subpage
tracking have some significant overhead on aim9 and ebizzy benchmark and
to support THP with 4K subpage, we do need a pgtable_t of 4096 bytes.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bf680d51

powerpc/mm: Don't use pmd_val, pud_val and pgd_val as lvalue · f281b5d5

由 Aneesh Kumar K.V 提交于 12月 01, 2015

We convert them static inline function here as we did with pte_val in
the previous patch
Acked-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f281b5d5

08 8月, 2015 1 次提交

powerpc/booke64: Move mb() to __set_pte_at() with kernel-addr test · 0d61f0b3

由 Scott Wood 提交于 7月 18, 2015

map_kernel() doesn't catch all places that create kernel PTEs.  In
particular, vmalloc() calls set_pte_at() directly.  This causes a
crash when booting a non-SMP kernel on e6500.

Move the sync to __set_pte(), to be executed only for kernel addresses.
Signed-off-by: NScott Wood <scottwood@freescale.com>

0d61f0b3

25 6月, 2015 3 次提交

mm: clarify that the function operates on hugepage pte · 8809aa2d

由 Aneesh Kumar K.V 提交于 6月 24, 2015

We have confusing functions to clear pmd, pmd_clear_* and pmd_clear.  Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8809aa2d

powerpc/mm: use generic version of pmdp_clear_flush() · f28b6ff8

由 Aneesh Kumar K.V 提交于 6月 24, 2015

Also move the pmd_trans_huge check to generic code.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f28b6ff8

mm/thp: split out pmd collapse flush into separate functions · 15a25b2e

由 Aneesh Kumar K.V 提交于 6月 24, 2015

Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse. For them this operation is largely different from a
normal hugepage pte clear. Hence add a separate function to clear pmd
before collapse. After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.

[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table. But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries. By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.

This patch is a cleanup and only does code movement for clarity. There
should not be any change in functionality.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

15a25b2e

12 5月, 2015 1 次提交

powerpc/thp: Serialize pmd clear against a linux page table walk. · 13bd817b

由 Aneesh Kumar K.V 提交于 5月 11, 2015

Serialize against find_linux_pte_or_hugepte() which does lock-less
lookup in page tables with local interrupts disabled. For huge pages it
casts pmd_t to pte_t. Since the format of pte_t is different from pmd_t
we want to prevent transit from pmd pointing to page table to pmd
pointing to huge page (and back) while interrupts are disabled.  We
clear pmd to possibly replace it with page table pointer in different
code paths. So make sure we wait for the parallel
find_linux_pte_or_hugepage() to finish.

Without this patch, a find_linux_pte_or_hugepte() running in parallel to
__split_huge_zero_page_pmd() or do_huge_pmd_wp_page_fallback() or
zap_huge_pmd() can run into the above issue. With
__split_huge_zero_page_pmd() and do_huge_pmd_wp_page_fallback() we clear
the hugepage pte before inserting the pmd entry with a regular pgtable
address. Such a clear need to wait for the parallel
find_linux_pte_or_hugepte() to finish.

With zap_huge_pmd(), we can run into issues, with a hugepage pte getting
zapped due to a MADV_DONTNEED while other cpu fault it in as small
pages.
Reported-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

13bd817b

10 4月, 2015 2 次提交

powerpc: Replace mem_init_done with slab_is_available() · f691fa10

由 Michael Ellerman 提交于 3月 30, 2015

We have a powerpc specific global called mem_init_done which is "set on
boot once kmalloc can be called".

But that's not *quite* true. We set it at the bottom of mem_init(), and
rely on the fact that mm_init() calls kmem_cache_init() immediately
after that, and nothing is running in parallel.

So replace it with the generic and 100% correct slab_is_available().
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f691fa10

powerpc: Fix compile errors with STRICT_MM_TYPECHECKS enabled · 4f9c53c8

由 Michael Ellerman 提交于 3月 25, 2015

Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Fix the 32-bit code also]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4f9c53c8

17 2月, 2015 1 次提交

powerpc: drop _PAGE_FILE and pte_file()-related helpers · 780fc564

由 Kirill A. Shutemov 提交于 2月 16, 2015

We've replaced remap_file_pages(2) implementation with emulation.  Nobody
creates non-linear mapping anymore.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

780fc564

13 2月, 2015 1 次提交

mm: convert p[te|md]_numa users to p[te|md]_protnone_numa · 8a0516ed

由 Mel Gorman 提交于 2月 12, 2015

Convert existing users of pte_numa and friends to the new helper.  Note
that the kernel is broken after this patch is applied until the other page
table modifiers are also altered.  This patch layout is to make review
easier.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Tested-by: NSasha Levin <sasha.levin@oracle.com>
Cc: Dave Jones <davej@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a0516ed

05 12月, 2014 1 次提交

powerpc/mm: don't do tlbie for updatepp request with NO HPTE fault · aefa5688

由 Aneesh Kumar K.V 提交于 12月 04, 2014

upatepp can get called for a nohpte fault when we find from the linux
page table that the translation was hashed before. In that case
we are sure that there is no existing translation, hence we could
avoid doing tlbie.

We could possibly race with a parallel fault filling the TLB. But
that should be ok because updatepp is only ever relaxing permissions.
We also look at linux pte permission bits when filling hash pte
permission bits. We also hold the linux pte busy bits while
inserting/updating a hashpte entry, hence a paralle update of
linux pte is not possible. On the other hand mprotect involves
ptep_modify_prot_start which cause a hpte invalidate and not updatepp.

Performance number:
We use randbox_access_bench written by Anton.

Kernel with THP disabled and smaller hash page table size.

86.60% random_access_b [kernel.kallsyms] [k] .native_hpte_updatepp
2.10% random_access_b random_access_bench [.] doit
1.99% random_access_b [kernel.kallsyms] [k] .do_raw_spin_lock
1.85% random_access_b [kernel.kallsyms] [k] .native_hpte_insert
1.26% random_access_b [kernel.kallsyms] [k] .native_flush_hash_range
1.18% random_access_b [kernel.kallsyms] [k] .__delay
0.69% random_access_b [kernel.kallsyms] [k] .native_hpte_remove
0.37% random_access_b [kernel.kallsyms] [k] .clear_user_page
0.34% random_access_b [kernel.kallsyms] [k] .__hash_page_64K
0.32% random_access_b [kernel.kallsyms] [k] fast_exception_return
0.30% random_access_b [kernel.kallsyms] [k] .hash_page_mm

With Fix:

27.54% random_access_b random_access_bench [.] doit
22.90% random_access_b [kernel.kallsyms] [k] .native_hpte_insert
5.76% random_access_b [kernel.kallsyms] [k] .native_hpte_remove
5.20% random_access_b [kernel.kallsyms] [k] fast_exception_return
5.12% random_access_b [kernel.kallsyms] [k] .__hash_page_64K
4.80% random_access_b [kernel.kallsyms] [k] .hash_page_mm
3.31% random_access_b [kernel.kallsyms] [k] data_access_common
1.84% random_access_b [kernel.kallsyms] [k] .trace_hardirqs_on_caller
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

aefa5688

02 12月, 2014 2 次提交

powerpc/mm/thp: Use tlbiel if possible · d557b098

由 Aneesh Kumar K.V 提交于 11月 02, 2014

If we know that user address space has never executed on other cpus
we could use tlbiel.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

d557b098

powerpc/mm/thp: Remove code duplication · f1581bf1

由 Aneesh Kumar K.V 提交于 11月 02, 2014

Rename invalidate_old_hpte to flush_hash_hugepage and use that in
other places.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

f1581bf1

14 11月, 2014 1 次提交

powerpc/mm: Add missing pmd accessors · 06743521

由 Aneesh Kumar K.V 提交于 11月 05, 2014

This patch add documentation and missing accessors.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

06743521

10 11月, 2014 3 次提交

powerpc: Remove superfluous bootmem includes · 68cf0d64

由 Anton Blanchard 提交于 9月 17, 2014

Lots of places included bootmem.h even when not using bootmem.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Tested-by: NEmil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

68cf0d64

powerpc: Remove some old bootmem related comments · 14ed7409

由 Anton Blanchard 提交于 9月 17, 2014

Now bootmem is gone from powerpc we can remove comments mentioning it.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Tested-by: NEmil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

14ed7409

powerpc: Remove bootmem allocator · 10239733

由 Anton Blanchard 提交于 9月 17, 2014

At the moment we transition from the memblock alloctor to the bootmem
allocator. Gitting rid of the bootmem allocator removes a bunch of
complicated code (most of which I owe the dubious honour of being
responsible for writing).
Signed-off-by: NAnton Blanchard <anton@samba.org>
Tested-by: NEmil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

10239733

13 8月, 2014 3 次提交

powerpc/thp: Add tracepoints to track hugepage invalidate · 9e813308

由 Aneesh Kumar K.V 提交于 8月 13, 2014

Add tracepoint to track hugepage invalidate. This help us
in debugging difficult to track bugs.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

9e813308

powerpc/thp: Handle combo pages in invalidate · fc047955

由 Aneesh Kumar K.V 提交于 8月 13, 2014

If we changed base page size of the segment, either via sub_page_protect
or via remap_4k_pfn, we do a demote_segment which doesn't flush the hash
table entries. We do a lazy hash page table flush for all mapped pages
in the demoted segment. This happens when we handle hash page fault for
these pages.

We use _PAGE_COMBO bit along with _PAGE_HASHPTE to indicate whether a
pte is backed by 4K hash pte. If we find _PAGE_COMBO not set on the pte,
that implies that we could possibly have older 64K hash pte entries in
the hash page table and we need to invalidate those entries.

Use _PAGE_COMBO to determine the page size with which we should
invalidate the hash table entries on unmap.

CC: <stable@vger.kernel.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

fc047955

powerpc/thp: Don't recompute vsid and ssize in loop on invalidate · fa1f8ae8

由 Aneesh Kumar K.V 提交于 8月 13, 2014

The segment identifier and segment size will remain the same in
the loop, So we can compute it outside. We also change the
hugepage_invalidate interface so that we can use it the later patch

CC: <stable@vger.kernel.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

fa1f8ae8

05 8月, 2014 1 次提交

powerpc/64e: Add __ref to early_alloc_pgtable() · 7d176221

由 Scott Wood 提交于 8月 01, 2014

This silences a section mismatch warning.  early_alloc_pgtable() is
called from map_kernel_page() which cannot be __init, but only when
slab_is_available() returns false which can only happen during early
boot.
Signed-off-by: NScott Wood <scottwood@freescale.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

7d176221

24 3月, 2014 1 次提交

powerpc/mm: Make sure a local_irq_disable prevent a parallel THP split · 346519a1

由 Aneesh Kumar K.V 提交于 3月 15, 2014

We have generic code like the one in get_futex_key that assume that
a local_irq_disable prevents a parallel THP split. Support that by
adding a dummy smp call function after setting _PAGE_SPLITTING. Code
paths like get_user_pages_fast still need to check for _PAGE_SPLITTING
after disabling IRQ which indicate that a parallel THP splitting is
ongoing. Now if they don't find _PAGE_SPLITTING set, then we can be
sure that parallel split will now block in pmdp_splitting flush
until we enables IRQ
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

346519a1

17 2月, 2014 1 次提交

powerpc/mm: Add new "set" flag argument to pte/pmd update function · 88247e8d

由 Aneesh Kumar K.V 提交于 2月 12, 2014

pte_update() is a powerpc-ism used to change the bits of a PTE
when the access permission is being restricted (a flush is
potentially needed).

It uses atomic operations on when needed and handles the hash
synchronization on hash based processors.

It is currently only used to clear PTE bits and so the current
implementation doesn't provide a way to also set PTE bits.

The new _PAGE_NUMA bit, when set, is actually restricting access
so it must use that function too, so this change adds the ability
for pte_update() to also set bits.

We will use this later to set the _PAGE_NUMA bit.
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

88247e8d

15 1月, 2014 1 次提交

powerpc: Delete non-required instances of include <linux/init.h> · c141611f

由 Paul Gortmaker 提交于 1月 09, 2014

None of these files are actually using any __init type directives
and hence don't need to include <linux/init.h>.  Most are just a
left over from __devinit and __cpuinit removal, or simply due to
code getting copied from one driver to the next.

The one instance where we add an include for init.h covers off
a case where that file was implicitly getting it from another
header which itself didn't need it.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

c141611f

10 1月, 2014 1 次提交

powerpc: add barrier after writing kernel PTE · 47ce8af4

由 Scott Wood 提交于 10月 11, 2013

There is no barrier between something like ioremap() writing to
a PTE, and returning the value to a caller that may then store the
pointer in a place that is visible to other CPUs.  Such callers
generally don't perform barriers of their own.

Even if callers of ioremap() and similar things did use barriers,
the most logical choise would be smp_wmb(), which is not
architecturally sufficient when BookE hardware tablewalk is used.  A
full sync is specified by the architecture.

For userspace mappings, OTOH, we generally already have an lwsync due
to locking, and if we occasionally take a spurious fault due to not
having a full sync with hardware tablewalk, it will not be fatal
because we will retry rather than oops.
Signed-off-by: NScott Wood <scottwood@freescale.com>

47ce8af4

09 12月, 2013 1 次提交

powerpc/mm: Only check for _PAGE_PRESENT in set_pte/pmd functions · 8937ba48

由 Aneesh Kumar K.V 提交于 11月 18, 2013

We want to make sure we don't use these function when updating a pte
or pmd entry that have a valid hpte entry, because these functions
don't invalidate them. So limit the check to _PAGE_PRESENT bit.
Numafault core changes use these functions for updating _PAGE_NUMA bits.
That should be ok because when _PAGE_NUMA is set we can be sure that
hpte entries are not present.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

8937ba48

15 11月, 2013 1 次提交

powerpc: handle pgtable_page_ctor() fail · 4f804943

由 Kirill A. Shutemov 提交于 11月 14, 2013

Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4f804943

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功