- 31 8月, 2017 1 次提交
-
-
由 Ram Pai 提交于
In handling a H_ENTER hypercall, the code in kvmppc_do_h_enter clobbers the high-order two bits of the storage key, which is stored in a split field in the second doubleword of the HPTE. Any storage key number above 7 hence fails to operate correctly. This makes sure we preserve all the bits of the storage key. Acked-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NRam Pai <linuxram@us.ibm.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 28 7月, 2017 1 次提交
-
-
由 Oliver O'Halloran 提交于
The Radix MMU translation tree as defined in ISA v3.0 contains two different types of entry, directories and leaves. Leaves are identified by _PAGE_PTE being set. The formats of the two entries are different, with the directory entries containing no spare bits for use by software. In particular the bit we use for _PAGE_DEVMAP is not reserved for software, and is part of the NLB (Next Level Base) field, essentially the address of the next level in the tree. Note that the Linux pte_t is not == _PAGE_PTE. A huge page pmd entry (or devmap!) is also a leaf and so has _PAGE_PTE set, even though we use a pmd_t for it in Linux. The fix is to ensure that the pmd/pte_devmap() confirm they are looking at a leaf entry (_PAGE_PTE) as well as checking _PAGE_DEVMAP. Fixes: ebd31197 ("powerpc/mm: Add devmap support for ppc64") Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Tested-by: NLaurent Vivier <lvivier@redhat.com> Tested-by: NJose Ricardo Ziviani <joserz@linux.vnet.ibm.com> Reviewed-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Add a comment in the code and flesh out change log] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 26 7月, 2017 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 18 7月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 13 7月, 2017 1 次提交
-
-
由 Michal Hocko 提交于
__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to the page allocator. This has been true but only for allocations requests larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for smaller sizes. This is a bit unfortunate because there is no way to express the same semantic for those requests and they are considered too important to fail so they might end up looping in the page allocator for ever, similarly to GFP_NOFAIL requests. Now that the whole tree has been cleaned up and accidental or misled usage of __GFP_REPEAT flag has been removed for !costly requests we can give the original flag a better name and more importantly a more useful semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that the allocator would try really hard but there is no promise of a success. This will work independent of the order and overrides the default allocator behavior. Page allocator users have several levels of guarantee vs. cost options (take GFP_KERNEL as an example) - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_ attempt to free memory at all. The most light weight mode which even doesn't kick the background reclaim. Should be used carefully because it might deplete the memory and the next user might hit the more aggressive reclaim - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic allocation without any attempt to free memory from the current context but can wake kswapd to reclaim memory if the zone is below the low watermark. Can be used from either atomic contexts or when the request is a performance optimization and there is another fallback for a slow path. - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non sleeping allocation with an expensive fallback so it can access some portion of memory reserves. Usually used from interrupt/bh context with an expensive slow path fallback. - GFP_KERNEL - both background and direct reclaim are allowed and the _default_ page allocator behavior is used. That means that !costly allocation requests are basically nofail but there is no guarantee of that behavior so failures have to be checked properly by callers (e.g. OOM killer victim is allowed to fail currently). - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and all allocation requests fail early rather than cause disruptive reclaim (one round of reclaim in this implementation). The OOM killer is not invoked. - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior and all allocation requests try really hard. The request will fail if the reclaim cannot make any progress. The OOM killer won't be triggered. - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior and all allocation requests will loop endlessly until they succeed. This might be really dangerous especially for larger orders. Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because they already had their semantic. No new users are added. __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if there is no progress and we have already passed the OOM point. This means that all the reclaim opportunities have been exhausted except the most disruptive one (the OOM killer) and a user defined fallback behavior is more sensible than keep retrying in the page allocator. [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c] [mhocko@suse.com: semantic fix] Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz [mhocko@kernel.org: address other thing spotted by Vlastimil] Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 7月, 2017 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
POWER9 supports hugepages of size 2M and 1G in radix MMU mode. This patch enables the usage of 1G page size for hugetlbfs. This also update the helper such we can do 1G page allocation at runtime. We still don't enable 1G page size on DD1 version. This is to avoid doing workaround mentioned in commit 6d3a0379 ("powerpc/mm: Add radix__tlb_flush_pte_p9_dd1()"). Link: http://lkml.kernel.org/r/1494995292-4443-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 7月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
With hash we update the bolted pte to mark it read-only. We rely on the MMU_FTR_KERNEL_RO to generate the correct permissions for read-only text. The radix implementation just prints a warning in this implementation Signed-off-by: NBalbir Singh <bsingharora@gmail.com> [mpe: Make the warning louder when we don't have MMU_FTR_KERNEL_RO] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 02 7月, 2017 1 次提交
-
-
由 Oliver O'Halloran 提交于
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S. This is used to differentiate device backed memory from transparent huge pages since they are handled in more or less the same manner by the core mm code. Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 08 6月, 2017 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
Supporting 512TB requires us to do a order 3 allocation for level 1 page table (pgd). This results in page allocation failures with certain workloads. For now limit 4k linux page size config to 64TB. Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB") Reported-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 05 6月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
Introduce a helper pgtable_gfp_flags() which just returns the current gfp flags and adds __GFP_ACCOUNT to account for page table allocation. The generic helper is added to include/asm/pgalloc.h and has two variants - WARNING ugly bits ahead 1. If the header is included from a module, no check for mm == &init_mm is done, since init_mm is not exported 2. For kernel includes, the check is done and required see (3e79ec7d arch: x86: charge page tables to kmemcg) The fundamental assumption is that no module should be doing pgd/pud/pmd and pte alloc's on behalf of init_mm directly. NOTE: This adds an overhead to pmd/pud/pgd allocations similar to x86. The other alternative was to implement pmd_alloc_kernel/pud_alloc_kernel and pgd_alloc_kernel with their offset variants. For 4k page size, pte_alloc_one no longer calls pte_alloc_one_kernel. Signed-off-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 09 5月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB") we increased the virtual address space for user processes to 128TB by default, and up to 512TB if user space opts in. This obviously required expanding the range of the Linux page tables. For Book3s 64-bit using hash and with PAGE_SIZE=64K, we increased the PGD to 2^15 entries. This meant we could cover the full address range, while still being able to insert a 16G hugepage at the PGD level and a 16M hugepage in the PMD. The downside of that geometry is that it uses a lot of memory for the PGD, and in particular makes the PGD a 4-page allocation, which means it's much more likely to fail under memory pressure. Instead we can make the PMD larger, so that a single PUD entry maps 16G, allowing the 16G hugepages to sit at that level in the tree. We're then able to split the remaining bits between the PUG and PGD. We make the PGD slightly larger as that results in lower memory usage for typical programs. When THP is enabled the PMD actually doubles in size, to 2^11 entries, or 2^14 bytes, which is large but still < PAGE_SIZE. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-
- 28 4月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Michal Suchánek noticed a comment in book3s/64/mmu-hash.h about the context ids we use for the kernel was inconsistent with the code and other comments in the same file. It should read 1-4 not 1-5. While we're touching it, update "address" to "addresses" which makes more sense as it's referring to more than one address below. Reported-by: NMichal Suchánek <msuchanek@suse.de> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 4月, 2017 1 次提交
-
-
由 Christophe Leroy 提交于
On some targets, _PAGE_RW is 0 and this is _PAGE_RO which is used. There is also _PAGE_SHARED that is missing. Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 12 4月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Recently in commit f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB"), we increased H_PGD_INDEX_SIZE to 15 when we're building with 64K pages. This makes it larger than RADIX_PGD_INDEX_SIZE (13), which means the logic to calculate MAX_PGD_INDEX_SIZE in book3s/64/pgtable.h is wrong. The end result is that the PGD (Page Global Directory, ie top level page table) of the kernel (aka. swapper_pg_dir), is too small. This generally doesn't lead to a crash, as we don't use the full range in normal operation. However if we try to dump the kernel pagetables we can trigger a crash because we walk off the end of the pgd into other memory and eventually try to dereference something bogus: $ cat /sys/kernel/debug/kernel_pagetables Unable to handle kernel paging request for data at address 0xe8fece0000000000 Faulting instruction address: 0xc000000000072314 cpu 0xc: Vector: 380 (Data SLB Access) at [c0000000daa13890] pc: c000000000072314: ptdump_show+0x164/0x430 lr: c000000000072550: ptdump_show+0x3a0/0x430 dar: e802cf0000000000 seq_read+0xf8/0x560 full_proxy_read+0x84/0xc0 __vfs_read+0x6c/0x1d0 vfs_read+0xbc/0x1b0 SyS_read+0x6c/0x110 system_call+0x38/0xfc The root cause is that MAX_PGD_INDEX_SIZE isn't actually computed to be the max of H_PGD_INDEX_SIZE or RADIX_PGD_INDEX_SIZE. To fix that move the calculation into asm-offsets.c where we can do it easily using max(). Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 04 4月, 2017 1 次提交
-
-
由 Alistair Popple 提交于
Nvlink2 supports address translation services (ATS) allowing devices to request address translations from an mmu known as the nest MMU which is setup to walk the CPU page tables. To access this functionality certain firmware calls are required to setup and manage hardware context tables in the nvlink processing unit (NPU). The NPU also manages forwarding of TLB invalidates (known as address translation shootdowns/ATSDs) to attached devices. This patch exports several methods to allow device drivers to register a process id (PASID/PID) in the hardware tables and to receive notification of when a device should stop issuing address translation requests (ATRs). It also adds a fault handler to allow device drivers to demand fault pages in. Signed-off-by: NAlistair Popple <alistair@popple.id.au> [mpe: Fix up comment formatting, use flush_tlb_mm()] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 01 4月, 2017 2 次提交
-
-
由 Aneesh Kumar K.V 提交于
Now that we use all the available virtual address range, we need to make sure we don't generate VSID such that it overlaps with the reserved vsid range. Reserved vsid range include the virtual address range used by the adjunct partition and also the VRMA virtual segment. We find the context value that can result in generating such a VSID and reserve it early in boot. We don't look at the adjunct range, because for now we disable the adjunct usage in a Linux LPAR via CAS interface. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Rewrite hash__reserve_context_id(), move the rest into pseries] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
In the followup patch, we will increase the slice array size to handle 512TB range, but will limit the max addr to 128TB. Avoid doing unnecessary computation and avoid doing slice mask related operation above address limit. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 31 3月, 2017 12 次提交
-
-
由 Aneesh Kumar K.V 提交于
We update the hash linux page table layout such that we can support 512TB. But we limit the TASK_SIZE to 128TB. We can switch to 128TB by default without conditional because that is the max virtual address supported by other architectures. We will later add a mechanism to on-demand increase the application's effective address range to 512TB. Having the page table layout changed to accommodate 512TB makes testing large memory configuration easier with less code changes to kernel Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
This doesn't have any functional change. But helps in avoiding mistakes in case the shift bit changes Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Inorder to support large effective address range (512TB), we want to increase the virtual address bits to 68. But we do have platforms like p4 and p5 that can only do 65 bit VA. We support those platforms by limiting context bits on them to 16. The protovsid -> vsid conversion is verified to work with both 65 and 68 bit va values. I also documented the restrictions in a table format as part of code comments. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
get_kernel_vsid() has a very stern comment saying that it's only valid for kernel addresses, but there's nothing in the code to enforce that. Rather than hoping our callers are well behaved, add a check and return a VSID of 0 (invalid). Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel. Kernel VSIDs are built using these top context values and effective the segement ID. In subsequent patches we want to increase the max effective address to 512TB. We will achieve that by increasing the effective segment IDs there by increasing virtual address range. We will be switching to a 68bit virtual address in the following patch. But platforms like Power4 and Power5 only support a 65 bit virtual address. We will handle that by limiting the context bits to 16 instead of 19 on those platforms. That means the max context id will have a different value on different platforms. So that we don't have to deal with the kernel context ids changing between different platforms, move the kernel context ids down to use context ids 1-4. We can't use segment 0 of context-id 0, because that maps to VSID 0, which we want to keep as invalid, so we avoid context-id 0 entirely. Similarly we can't use the last segment of the maximum context, so we avoid it too. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
Complete the split of the radix vs hash mm context initialisation. This is mostly code movement, with the exception that we now limit the context allocation to PRTB_ENTRIES - 1 on radix. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
We don't support the full 57 bits of physical address and hence can overload the top bits of RPN as hash specific pte bits. Add a BUILD_BUG_ON() to enforce the relationship between H_PAGE_F_SECOND and H_PAGE_F_GIX. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> [mpe: Move the BUILD_BUG_ON() into hash_utils_64.c and comment it] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Max value supported by hardware is 51 bits address. Radix page table define a slot of 57 bits for future expansion. We restrict the value supported in linux kernel 53 bits, so that we can use the bits between 57-53 for storing hash linux page table bits. This is done in the next patch. This will free up the software page table bits to be used for features that are needed for both hash and radix. The current hash linux page table format doesn't have any free software bits. Moving hash linux page table specific bits to top of RPN field free up the software bits for other purpose. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Conditional PTE bit definition is confusing and results in coding error. Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
This bit is only used by radix and it is nice to follow the naming style of having bit name start with H_/R_ depending on which translation mode they are used. No functional change in this patch. Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Define everything based on bits present in pgtable.h. This will help in easily identifying overlapping bits between hash/radix. No functional change with this patch. Reviewed-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 10 3月, 2017 3 次提交
-
-
由 Aneesh Kumar K.V 提交于
We use pte_write() to check whethwer the pte entry is writable. This is mostly used to later mark the pte read only if it is writable. The other use of pte_write() is to check whether the pte_entry is writable so that hardware page table entry can be marked accordingly. This is used in kvm where we look at qemu page table entry and update hardware hash page table for the guest with correct write enable bit. With the above, for the first usage we should also check the savedwrite bit so that we can correctly clear the savedwite bit. For the later, we add a new variant __pte_write(). With this we can revert write_protect_page part of 595cd8f2 ("mm/ksm: handle protnone saved writes when making page write protect"). But I left it as it is as an example code for savedwrite check. Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write") Link: http://lkml.kernel.org/r/1488203787-17849-2-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
We need to mark pages of parent process read only on fork. Numa fault pte needs a protnone ptes variant with saved write flag set. On fork we need to make sure we remove the saved write bit. Instead of adding the protnone check in the caller update ptep_set_wrprotect variants to clear savedwrite bit. Without this we see random segfaults in application on fork. Fixes: c137a275 ("powerpc/mm/autonuma: switch ppc64 to its own implementation of saved write") Link: http://lkml.kernel.org/r/1488203787-17849-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
If an architecture uses 4level-fixup.h we don't need to do anything as it includes 5level-fixup.h. If an architecture uses pgtable-nop*d.h, define __ARCH_USE_5LEVEL_HACK before inclusion of the header. It makes asm-generic code to use 5level-fixup.h. If an architecture has 4-level paging or folds levels on its own, include 5level-fixup.h directly. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 3月, 2017 1 次提交
-
-
由 Paul Mackerras 提交于
This fixes some bugs in the code that walks the guest's page tables. These bugs cause MMIO emulation to fail whenever the guest is in virtial mode (MMU on), leading to the guest hanging if it tried to access a virtio device. The first bug was that when reading the guest's process table, we were using the whole of arch->process_table, not just the field that contains the process table base address. The second bug was that the mask used when reading the process table entry to get the radix tree base address, RPDB_MASK, had the wrong value. Fixes: 9e04ba69 ("KVM: PPC: Book3S HV: Add basic infrastructure for radix guests") Fixes: e9983344 ("powerpc/mm/radix: Add partition table format & callback") Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 28 2月, 2017 1 次提交
-
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: partiton||partition Link: http://lkml.kernel.org/r/1481573103-11329-7-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 2月, 2017 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
With this our protnone becomes a present pte with READ/WRITE/EXEC bit cleared. By default we also set _PAGE_PRIVILEGED on such pte. This is now used to help us identify a protnone pte that as saved write bit. For such pte, we will clear the _PAGE_PRIVILEGED bit. The pte still remain non-accessible from both user and kernel. [aneesh.kumar@linux.vnet.ibm.com: v3] Link: http://lkml.kernel.org/r/1487498625-10891-4-git-send-email-aneesh.kumar@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1487050314-3892-3-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NMichael Neuling <mikey@neuling.org> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Michael Ellerman <michaele@au1.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 2月, 2017 3 次提交
-
-
由 Aneesh Kumar K.V 提交于
We do them at the start of tlb flush, and we are sure a pte update will be followed by a tlbflush. Hence we can skip the ptesync in pte update helpers. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
This helps us to do some optimization for application exit case, where we can skip the DD1 style pte update sequence. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
In the kernel we do follow the below sequence in different code paths. pte = ptep_get_clear(ptep) .... set_pte_at(ptep, pte) We do that for mremap, autonuma protection update and softdirty clearing. This implies our optimization to skip a tlb flush when clearing a pte update is not valid, because for DD1 system that followup set_pte_at will be done witout doing the required tlbflush. Fix that by always doing the dd1 style pte update irrespective of new_pte value. In a later patch we will optimize the application exit case. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 14 2月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
If we enable RADIX but disable HUGETLBFS, the build breaks with: arch/powerpc/mm/pgtable-radix.c:557:7: error: implicit declaration of function 'pmd_huge' arch/powerpc/mm/pgtable-radix.c:588:7: error: implicit declaration of function 'pud_huge' Fix it by stubbing those functions when HUGETLBFS=n. Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 10 2月, 2017 1 次提交
-
-
由 David Gibson 提交于
This adds support for using two hypercalls to change the size of the main hash page table while running as a PAPR guest. For now these hypercalls are only in experimental qemu versions. The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate and prepare the new hash table. This may be slow, but can be done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the new hash table. This requires that no CPUs be concurrently updating the HPT, and so must be run under stop_machine(). This also adds a debugfs file which can be used to manually control HPT resizing or testing purposes. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> Reviewed-by: NPaul Mackerras <paulus@samba.org> [mpe: Rename the debugfs file to "hpt_order"] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-