提交 · d182b8fd6084412963cdb1a16d04c2f07234e82b · openeuler / Kernel

31 8月, 2017 1 次提交

KVM: PPC: Book3S HV: Fix setting of storage key in H_ENTER · d182b8fd

由 Ram Pai 提交于 7月 31, 2017

In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter
clobbers the high-order two bits of the storage key, which is stored
in a split field in the second doubleword of the HPTE.  Any storage
key number above 7 hence fails to operate correctly.

This makes sure we preserve all the bits of the storage key.
Acked-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NRam Pai <linuxram@us.ibm.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d182b8fd

28 7月, 2017 1 次提交

powerpc/mm: Fix pmd/pte_devmap() on non-leaf entries · c9c98bc5

由 Oliver O'Halloran 提交于 7月 28, 2017

The Radix MMU translation tree as defined in ISA v3.0 contains two
different types of entry, directories and leaves. Leaves are
identified by _PAGE_PTE being set.

The formats of the two entries are different, with the directory
entries containing no spare bits for use by software. In particular
the bit we use for _PAGE_DEVMAP is not reserved for software, and is
part of the NLB (Next Level Base) field, essentially the address of
the next level in the tree.

Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd
entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even
though we use a pmd_t for it in Linux.

The fix is to ensure that the pmd/pte_devmap() confirm they are
looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP.

Fixes: ebd31197 ("powerpc/mm: Add devmap support for ppc64")
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
Tested-by: NLaurent Vivier <lvivier@redhat.com>
Tested-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com>
Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Add a comment in the code and flesh out change log]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c9c98bc5

26 7月, 2017 1 次提交

powerpc/mm/radix: Workaround prefetch issue with KVM · a25bd72b

由 Benjamin Herrenschmidt 提交于 7月 24, 2017

There's a somewhat architectural issue with Radix MMU and KVM.

When coming out of a guest with AIL (Alternate Interrupt Location, ie,
MMU enabled), we start executing hypervisor code with the PID register
still containing whatever the guest has been using.

The problem is that the CPU can (and will) then start prefetching or
speculatively load from whatever host context has that same PID (if
any), thus bringing translations for that context into the TLB, which
Linux doesn't know about.

This can cause stale translations and subsequent crashes.

Fixing this in a way that is neither racy nor a huge performance
impact is difficult. We could just make the host invalidations always
use broadcast forms but that would hurt single threaded programs for
example.

We chose to fix it instead by partitioning the PID space between guest
and host. This is possible because today Linux only use 19 out of the
20 bits of PID space, so existing guests will work if we make the host
use the top half of the 20 bits space.

We additionally add support for a property to indicate to Linux the
size of the PID register which will be useful if we eventually have
processors with a larger PID space available.

There is still an issue with malicious guests purposefully setting the
PID register to a value in the hosts PID range. Hopefully future HW
can prevent that, but in the meantime, we handle it with a pair of
kludges:

 - On the way out of a guest, before we clear the current VCPU in the
   PACA, we check the PID and if it's outside of the permitted range
   we flush the TLB for that PID.

 - When context switching, if the mm is "new" on that CPU (the
   corresponding bit was set for the first time in the mm cpumask), we
   check if any sibling thread is in KVM (has a non-NULL VCPU pointer
   in the PACA). If that is the case, we also flush the PID for that
   CPU (core).

This second part is needed to handle the case where a process is
migrated (or starts a new pthread) on a sibling thread of the CPU
coming out of KVM, as there's a window where stale translations can
exist before we detect it and flush them out.

A future optimization could be added by keeping track of whether the
PID has ever been used and avoid doing that for completely fresh PIDs.
We could similarily mark PIDs that have been the subject of a global
invalidation as "fresh". But for now this will do.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop
      unneeded include of kvm_book3s_asm.h]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a25bd72b

18 7月, 2017 1 次提交

powerpc/mm: Mark __init memory no-execute when STRICT_KERNEL_RWX=y · 029d9252

由 Michael Ellerman 提交于 7月 14, 2017

Currently even with STRICT_KERNEL_RWX we leave the __init text marked
executable after init, which is bad.

Add a hook to mark it NX (no-execute) before we free it, and implement
it for radix and hash.

Note that we use __init_end as the end address, not _einittext,
because overlaps_kernel_text() uses __init_end, because there are
additional executable sections other than .init.text between
__init_begin and __init_end.

Tested on radix and hash with:

  0:mon> p $__init_begin
  *** 400 exception occurred

Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

029d9252

13 7月, 2017 1 次提交

mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04

由 Michal Hocko 提交于 7月 12, 2017

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator.  This has been true but only for allocations
requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
ignored for smaller sizes.  This is a bit unfortunate because there is
no way to express the same semantic for those requests and they are
considered too important to fail so they might end up looping in the
page allocator for ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
that the allocator would try really hard but there is no promise of a
success.  This will work independent of the order and overrides the
default allocator behavior.  Page allocator users have several levels of
guarantee vs.  cost options (take GFP_KERNEL as an example)

 - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
   attempt to free memory at all. The most light weight mode which even
   doesn't kick the background reclaim. Should be used carefully because
   it might deplete the memory and the next user might hit the more
   aggressive reclaim

 - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
   allocation without any attempt to free memory from the current
   context but can wake kswapd to reclaim memory if the zone is below
   the low watermark. Can be used from either atomic contexts or when
   the request is a performance optimization and there is another
   fallback for a slow path.

 - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
   non sleeping allocation with an expensive fallback so it can access
   some portion of memory reserves. Usually used from interrupt/bh
   context with an expensive slow path fallback.

 - GFP_KERNEL - both background and direct reclaim are allowed and the
   _default_ page allocator behavior is used. That means that !costly
   allocation requests are basically nofail but there is no guarantee of
   that behavior so failures have to be checked properly by callers
   (e.g. OOM killer victim is allowed to fail currently).

 - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
   and all allocation requests fail early rather than cause disruptive
   reclaim (one round of reclaim in this implementation). The OOM killer
   is not invoked.

 - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
   behavior and all allocation requests try really hard. The request
   will fail if the reclaim cannot make any progress. The OOM killer
   won't be triggered.

 - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
   and all allocation requests will loop endlessly until they succeed.
   This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
because they already had their semantic.  No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point.

This means that all the reclaim opportunities have been exhausted except
the most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

[akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
[mhocko@suse.com: semantic fix]
  Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
[mhocko@kernel.org: address other thing spotted by Vlastimil]
  Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Alex Belits <alex.belits@cavium.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: NeilBrown <neilb@suse.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dcda9b04

07 7月, 2017 1 次提交

powerpc/mm/hugetlb: add support for 1G huge pages · 40692eb5

由 Aneesh Kumar K.V 提交于 7月 06, 2017

POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This
patch enables the usage of 1G page size for hugetlbfs. This also update
the helper such we can do 1G page allocation at runtime.

We still don't enable 1G page size on DD1 version. This is to avoid
doing workaround mentioned in commit 6d3a0379 ("powerpc/mm: Add
radix__tlb_flush_pte_p9_dd1()").

Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

40692eb5

04 7月, 2017 1 次提交

powerpc/mm/hash: Implement mark_rodata_ro() for hash · cd65d697

由 Balbir Singh 提交于 6月 29, 2017

With hash we update the bolted pte to mark it read-only. We rely
on the MMU_FTR_KERNEL_RO to generate the correct permissions
for read-only text. The radix implementation just prints a warning
in this implementation
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
[mpe: Make the warning louder when we don't have MMU_FTR_KERNEL_RO]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

cd65d697

02 7月, 2017 1 次提交

powerpc/mm: Add devmap support for ppc64 · ebd31197

由 Oliver O'Halloran 提交于 6月 28, 2017

Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S.  This
is used to differentiate device backed memory from transparent huge
pages since they are handled in more or less the same manner by the core
mm code.

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ebd31197

08 6月, 2017 1 次提交

powerpc/mm/4k: Limit 4k page size config to 64TB virtual address space · 92d9dfda

由 Aneesh Kumar K.V 提交于 6月 01, 2017

Supporting 512TB requires us to do a order 3 allocation for level 1 page
table (pgd). This results in page allocation failures with certain workloads.
For now limit 4k linux page size config to 64TB.

Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
Reported-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

92d9dfda

05 6月, 2017 1 次提交

powerpc/mm/book(e)(3s)/64: Add page table accounting · de3b8761

由 Balbir Singh 提交于 5月 02, 2017

Introduce a helper pgtable_gfp_flags() which
just returns the current gfp flags and adds
__GFP_ACCOUNT to account for page table allocation.
The generic helper is added to include/asm/pgalloc.h
and has two variants - WARNING ugly bits ahead

1. If the header is included from a module, no check
for mm == &init_mm is done, since init_mm is not
exported
2. For kernel includes, the check is done and required
see (3e79ec7d arch: x86: charge page tables to kmemcg)

The fundamental assumption is that no module should be
doing pgd/pud/pmd and pte alloc's on behalf of init_mm
directly.

NOTE: This adds an overhead to pmd/pud/pgd allocations
similar to x86.  The other alternative was to implement
pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel
with their offset variants.

For 4k page size, pte_alloc_one no longer calls
pte_alloc_one_kernel.
Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

de3b8761

09 5月, 2017 1 次提交

powerpc/mm/book3s/64: Rework page table geometry for lower memory usage · ba95b5d0

由 Michael Ellerman 提交于 5月 09, 2017

Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
we increased the virtual address space for user processes to 128TB by default,
and up to 512TB if user space opts in.

This obviously required expanding the range of the Linux page tables. For Book3s
64-bit using hash and with PAGE_SIZE=64K, we increased the PGD to 2^15 entries.
This meant we could cover the full address range, while still being able to
insert a 16G hugepage at the PGD level and a 16M hugepage in the PMD.

The downside of that geometry is that it uses a lot of memory for the PGD, and
in particular makes the PGD a 4-page allocation, which means it's much more
likely to fail under memory pressure.

Instead we can make the PMD larger, so that a single PUD entry maps 16G,
allowing the 16G hugepages to sit at that level in the tree. We're then able to
split the remaining bits between the PUG and PGD. We make the PGD slightly
larger as that results in lower memory usage for typical programs.

When THP is enabled the PMD actually doubles in size, to 2^11 entries, or 2^14
bytes, which is large but still < PAGE_SIZE.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NBalbir Singh <bsingharora@gmail.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

ba95b5d0

28 4月, 2017 1 次提交

powerpc/mm/hash: Fix off-by-one in comment about kernel contexts ids · add2e1e5

由 Michael Ellerman 提交于 4月 28, 2017

Michal Suchánek noticed a comment in book3s/64/mmu-hash.h about the context ids
we use for the kernel was inconsistent with the code and other comments in the
same file.

It should read 1-4 not 1-5.

While we're touching it, update "address" to "addresses" which makes more sense
as it's referring to more than one address below.
Reported-by: NMichal Suchánek <msuchanek@suse.de>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

add2e1e5

27 4月, 2017 1 次提交

powerpc/mm: Fix missing page attributes in page table dump · fd893fe5

由 Christophe Leroy 提交于 4月 14, 2017

On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used.
There is also _PAGE_SHARED that is missing.
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

fd893fe5

12 4月, 2017 1 次提交

powerpc/mm: Fix swapper_pg_dir size on 64-bit hash w/64K pages · 03dfee6d

由 Michael Ellerman 提交于 4月 12, 2017

Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"),
we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This
makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to
calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong.

The end result is that the PGD (Page Global Directory, ie top level page table)
of the kernel (aka. swapper_pg_dir), is too small.

This generally doesn't lead to a crash, as we don't use the full range in normal
operation. However if we try to dump the kernel pagetables we can trigger a
crash because we walk off the end of the pgd into other memory and eventually
try to dereference something bogus:

  $ cat /sys/kernel/debug/kernel_pagetables
  Unable to handle kernel paging request for data at address 0xe8fece0000000000
  Faulting instruction address: 0xc000000000072314
  cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890]
      pc: c000000000072314: ptdump_show+0x164/0x430
      lr: c000000000072550: ptdump_show+0x3a0/0x430
     dar: e802cf0000000000
  seq_read+0xf8/0x560
  full_proxy_read+0x84/0xc0
  __vfs_read+0x6c/0x1d0
  vfs_read+0xbc/0x1b0
  SyS_read+0x6c/0x110
  system_call+0x38/0xfc

The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be
the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move
the calculation into asm-offsets.c where we can do it easily using
max().

Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

03dfee6d

04 4月, 2017 1 次提交

powerpc/powernv: Introduce address translation services for Nvlink2 · 1ab66d1f

由 Alistair Popple 提交于 4月 03, 2017

Nvlink2 supports address translation services (ATS) allowing devices
to request address translations from an mmu known as the nest MMU
which is setup to walk the CPU page tables.

To access this functionality certain firmware calls are required to
setup and manage hardware context tables in the nvlink processing unit
(NPU). The NPU also manages forwarding of TLB invalidates (known as
address translation shootdowns/ATSDs) to attached devices.

This patch exports several methods to allow device drivers to register
a process id (PASID/PID) in the hardware tables and to receive
notification of when a device should stop issuing address translation
requests (ATRs). It also adds a fault handler to allow device drivers
to demand fault pages in.
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
[mpe: Fix up comment formatting, use flush_tlb_mm()]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1ab66d1f

01 4月, 2017 2 次提交

powerpc/pseries: Skip using reserved virtual address range · 82228e36

由 Aneesh Kumar K.V 提交于 3月 22, 2017

Now that we use all the available virtual address range, we need to make
sure we don't generate VSID such that it overlaps with the reserved vsid
range. Reserved vsid range include the virtual address range used by the
adjunct partition and also the VRMA virtual segment. We find the context
value that can result in generating such a VSID and reserve it early in
boot.

We don't look at the adjunct range, because for now we disable the
adjunct usage in a Linux LPAR via CAS interface.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Rewrite hash__reserve_context_id(), move the rest into pseries]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

82228e36

powerpc/mm: Add addr_limit to mm_context and use it to derive max slice index · 957b778a

由 Aneesh Kumar K.V 提交于 3月 22, 2017

In the followup patch, we will increase the slice array size to handle
512TB range, but will limit the max addr to 128TB. Avoid doing
unnecessary computation and avoid doing slice mask related operation
above address limit.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

957b778a

31 3月, 2017 12 次提交

powerpc/mm/hash: Increase VA range to 128TB · f6eedbba

由 Aneesh Kumar K.V 提交于 3月 22, 2017

We update the hash linux page table layout such that we can support
512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by
default without conditional because that is the max virtual address
supported by other architectures. We will later add a mechanism to
on-demand increase the application's effective address range to 512TB.

Having the page table layout changed to accommodate 512TB makes testing
large memory configuration easier with less code changes to kernel
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f6eedbba

powerpc/mm/hash: Convert mask to unsigned long · 59248aec

由 Aneesh Kumar K.V 提交于 3月 22, 2017

This doesn't have any functional change. But helps in avoiding mistakes
in case the shift bit changes
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

59248aec

powerpc/mm/hash: Support 68 bit VA · e6f81a92

由 Aneesh Kumar K.V 提交于 3月 29, 2017

Inorder to support large effective address range (512TB), we want to
increase the virtual address bits to 68. But we do have platforms like
p4 and p5 that can only do 65 bit VA. We support those platforms by
limiting context bits on them to 16.

The protovsid -> vsid conversion is verified to work with both 65 and 68
bit va values. I also documented the restrictions in a table format as
part of code comments.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e6f81a92

powerpc/mm/hash: Check for non-kernel address in get_kernel_vsid() · 85beb1c4

由 Michael Ellerman 提交于 3月 29, 2017

get_kernel_vsid() has a very stern comment saying that it's only valid
for kernel addresses, but there's nothing in the code to enforce that.

Rather than hoping our callers are well behaved, add a check and return
a VSID of 0 (invalid).
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

85beb1c4

powerpc/mm/hash: Use context ids 1-4 for the kernel · 941711a3

由 Aneesh Kumar K.V 提交于 3月 29, 2017

Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel.
Kernel VSIDs are built using these top context values and effective the
segement ID. In subsequent patches we want to increase the max effective
address to 512TB. We will achieve that by increasing the effective
segment IDs there by increasing virtual address range.

We will be switching to a 68bit virtual address in the following patch.
But platforms like Power4 and Power5 only support a 65 bit virtual
address. We will handle that by limiting the context bits to 16 instead
of 19 on those platforms. That means the max context id will have a
different value on different platforms.

So that we don't have to deal with the kernel context ids changing
between different platforms, move the kernel context ids down to use
context ids 1-4.

We can't use segment 0 of context-id 0, because that maps to VSID 0,
which we want to keep as invalid, so we avoid context-id 0 entirely.
Similarly we can't use the last segment of the maximum context, so we
avoid it too.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

941711a3

powerpc/mm: Split radix vs hash mm context initialisation · 760573c1

由 Michael Ellerman 提交于 3月 29, 2017

Complete the split of the radix vs hash mm context initialisation.

This is mostly code movement, with the exception that we now limit the
context allocation to PRTB_ENTRIES - 1 on radix.
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

760573c1

powerpc/mm: Move hash specific pte bits to be top bits of RPN · 6aa59f51

由 Aneesh Kumar K.V 提交于 3月 28, 2017

We don't support the full 57 bits of physical address and hence can
overload the top bits of RPN as hash specific pte bits.

Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND
and H_PAGE_F_GIX.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
[mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6aa59f51

powerpc/mm: Lower the max real address to 53 bits · 2f18d533

由 Aneesh Kumar K.V 提交于 3月 21, 2017

Max value supported by hardware is 51 bits address. Radix page table define
a slot of 57 bits for future expansion. We restrict the value supported in
linux kernel 53 bits, so that we can use the bits between 57-53 for storing
hash linux page table bits. This is done in the next patch.

This will free up the software page table bits to be used for features
that are needed for both hash and radix. The current hash linux page table
format doesn't have any free software bits. Moving hash linux page table
specific bits to top of RPN field free up the software bits for other purpose.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2f18d533

powerpc/mm: Define all PTE bits based on radix definitions. · 32789d38

由 Aneesh Kumar K.V 提交于 3月 21, 2017

Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

32789d38

powerpc/mm: Define _PAGE_SOFT_DIRTY unconditionally · 54c4025e

由 Aneesh Kumar K.V 提交于 3月 21, 2017

Conditional PTE bit definition is confusing and results in coding error.
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

54c4025e

powerpc/mm/radix: rename _PAGE_LARGE to R_PAGE_LARGE · ddb014b6

由 Aneesh Kumar K.V 提交于 3月 21, 2017

This bit is only used by radix and it is nice to follow the naming style of having
bit name start with H_/R_ depending on which translation mode they are used.

No functional change in this patch.
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ddb014b6

powerpc/mm: Cleanup bits definition between hash and radix. · f5bd0fdc

由 Aneesh Kumar K.V 提交于 3月 21, 2017

Define everything based on bits present in pgtable.h. This will help in easily
identifying overlapping bits between hash/radix.

No functional change with this patch.
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f5bd0fdc

10 3月, 2017 3 次提交

power/mm: update pte_write and pte_wrprotect to handle savedwrite · d19469e8

由 Aneesh Kumar K.V 提交于 3月 09, 2017

We use pte_write() to check whethwer the pte entry is writable. This is
mostly used to later mark the pte read only if it is writable. The other
use of pte_write() is to check whether the pte_entry is writable so that
hardware page table entry can be marked accordingly. This is used in kvm
where we look at qemu page table entry and update hardware hash page table
for the guest with correct write enable bit.

With the above, for the first usage we should also check the savedwrite
bit so that we can correctly clear the savedwite bit. For the later, we
add a new variant __pte_write().

With this we can revert write_protect_page part of 595cd8f2 ("mm/ksm:
handle protnone saved writes when making page write protect"). But I left
it as it is as an example code for savedwrite check.

Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d19469e8

powerpc/mm: handle protnone ptes on fork · 52c50ca7

由 Aneesh Kumar K.V 提交于 3月 09, 2017

We need to mark pages of parent process read only on fork. Numa fault
pte needs a protnone ptes variant with saved write flag set. On fork we
need to make sure we remove the saved write bit. Instead of adding the
protnone check in the caller update ptep_set_wrprotect variants to clear
savedwrite bit.

Without this we see random segfaults in application on fork.

Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write")
Link: http://lkml.kernel.org/r/1488203787-17849-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52c50ca7

arch, mm: convert all architectures to use 5level-fixup.h · 9849a569

由 Kirill A. Shutemov 提交于 3月 09, 2017

If an architecture uses 4level-fixup.h we don't need to do anything as
it includes 5level-fixup.h.

If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK
before inclusion of the header. It makes asm-generic code to use
5level-fixup.h.

If an architecture has 4-level paging or folds levels on its own,
include 5level-fixup.h directly.
Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9849a569

01 3月, 2017 1 次提交

KVM: PPC: Book3S HV: Fix software walk of guest process page tables · 70cd4c10

由 Paul Mackerras 提交于 2月 27, 2017

This fixes some bugs in the code that walks the guest's page tables.
These bugs cause MMIO emulation to fail whenever the guest is in
virtial mode (MMU on), leading to the guest hanging if it tried to
access a virtio device.

The first bug was that when reading the guest's process table, we were
using the whole of arch->process_table, not just the field that contains
the process table base address. The second bug was that the mask used
when reading the process table entry to get the radix tree base address,
RPDB_MASK, had the wrong value.

Fixes: 9e04ba69 ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests")
Fixes: e9983344 ("powerpc/mm/radix: Add partition table format & callback")
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

70cd4c10

28 2月, 2017 1 次提交

scripts/spelling.txt: add "partiton" pattern and fix typo instances · 8ab102d6

由 Masahiro Yamada 提交于 2月 27, 2017

Fix typos and add the following to the scripts/spelling.txt:

partiton||partition

Link: http://lkml.kernel.org/r/1481573103-11329-7-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8ab102d6

25 2月, 2017 1 次提交

powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write · c137a275

由 Aneesh Kumar K.V 提交于 2月 24, 2017

With this our protnone becomes a present pte with READ/WRITE/EXEC bit
cleared.  By default we also set _PAGE_PRIVILEGED on such pte.  This is
now used to help us identify a protnone pte that as saved write bit.
For such pte, we will clear the _PAGE_PRIVILEGED bit.  The pte still
remain non-accessible from both user and kernel.

[aneesh.kumar@linux.vnet.ibm.com: v3]
  Link: http://lkml.kernel.org/r/1487498625-10891-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1487050314-3892-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: NMichael Neuling <mikey@neuling.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c137a275

15 2月, 2017 3 次提交

powerpc/mm/radix: Skip ptesync in pte update helpers · 438e69b5

由 Aneesh Kumar K.V 提交于 2月 09, 2017

We do them at the start of tlb flush, and we are sure a pte update will be
followed by a tlbflush. Hence we can skip the ptesync in pte update helpers.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

438e69b5

powerpc/mm/radix: Use ptep_get_and_clear_full when clearing pte for full mm · f4894b80

由 Aneesh Kumar K.V 提交于 2月 09, 2017

This helps us to do some optimization for application exit case, where we can
skip the DD1 style pte update sequence.
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f4894b80

powerpc/mm/radix: Update pte update sequence for pte clear case · ca94573b

由 Aneesh Kumar K.V 提交于 2月 09, 2017

In the kernel we do follow the below sequence in different code paths.
pte = ptep_get_clear(ptep)
....
set_pte_at(ptep, pte)

We do that for mremap, autonuma protection update and softdirty clearing. This
implies our optimization to skip a tlb flush when clearing a pte update is
not valid, because for DD1 system that followup set_pte_at will be done witout
doing the required tlbflush. Fix that by always doing the dd1 style pte update
irrespective of new_pte value. In a later patch we will optimize the application
exit case.
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Tested-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

ca94573b

14 2月, 2017 1 次提交

powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n · aad71e39

由 Michael Ellerman 提交于 2月 14, 2017

If we enable RADIX but disable HUGETLBFS, the build breaks with:

arch/powerpc/mm/pgtable-radix.c:557:7: error: implicit declaration of function 'pmd_huge'
arch/powerpc/mm/pgtable-radix.c:588:7: error: implicit declaration of function 'pud_huge'

Fix it by stubbing those functions when HUGETLBFS=n.

Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

aad71e39

10 2月, 2017 1 次提交

powerpc/pseries: Add support for hash table resizing · dbcf929c

由 David Gibson 提交于 12月 09, 2016

This adds support for using two hypercalls to change the size of the
main hash page table while running as a PAPR guest. For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to
allocate and prepare the new hash table. This may be slow, but can be
done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
new hash table. This requires that no CPUs be concurrently updating the
HPT, and so must be run under stop_machine().

This also adds a debugfs file which can be used to manually control
HPT resizing or testing purposes.
Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au>
Reviewed-by: NPaul Mackerras <paulus@samba.org>
[mpe: Rename the debugfs file to "hpt_order"]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

dbcf929c

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功