- 21 9月, 2019 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
commit 89a3496e0664577043666791ec07fb731d57c950 upstream. We use mmu_vmemmap_psize to find the page size for mapping the vmmemap area. With radix translation, we are suboptimally setting this value to PAGE_SIZE. We do check for 2M page size support and update mmu_vmemap_psize to use hugepage size but we suboptimally reset the value to PAGE_SIZE in radix__early_init_mmu(). This resulted in always mapping vmemmap area with 64K page size. Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines") Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 8月, 2018 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
The Nest MMU workaround is only needed for RW upgrades. Avoid doing that for other PTE updates. We also avoid clearing the PTE while marking it invalid. This is because other page table walkers will find this PTE none and can result in unexpected behaviour due to that. Instead we clear _PAGE_PRESENT and set the software PTE bit _PAGE_INVALID. pte_present() is already updated to check for both bits. This makes sure page table walkers will find the PTE present and things like pte_pfn(pte) returns the right value. Based on an original patch from Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 13 8月, 2018 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
Add statistics that show how memory is mapped within the kernel linear mapping. This is similar to commit 37cd944c ("s390/pgtable: add mapping statistics") We don't do this with Hash translation mode. Hash uses one size (mmu_linear_psize) to map the kernel linear mapping and we print the linear psize during boot as below. "Page orders: linear mapping = 24, virtual = 16, io = 16, vmemmap = 24" A sample output looks like: DirectMap4k: 0 kB DirectMap64k: 18432 kB DirectMap2M: 1030144 kB DirectMap1G: 11534336 kB Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 16 7月, 2018 1 次提交
-
-
由 Nicholas Piggin 提交于
POWER9 DD1 was never a product. It is no longer supported by upstream firmware, and it is not effectively supported in Linux due to lack of testing. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Reviewed-by: NMichael Ellerman <mpe@ellerman.id.au> [mpe: Remove arch_make_huge_pte() entirely] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 03 6月, 2018 5 次提交
-
-
由 Nicholas Piggin 提交于
The ISA suggests ptesync after setting a pte, to prevent a table walk initiated by a subsequent access from missing that store and causing a spurious fault. This is an architectual allowance that allows an implementation's page table walker to be incoherent with the store queue. However there is no correctness problem in taking a spurious fault in userspace -- the kernel copes with these at any time, so the updated pte will be found eventually. Spurious kernel faults on vmap memory must be avoided, so a ptesync is put into flush_cache_vmap. On POWER9 so far I have not found a measurable window where this can result in more minor faults, so as an optimisation, remove the costly ptesync from pte updates. If an implementation benefits from ptesync, it would be better to add it back in update_mmu_cache, so it's not done for things like fork(2). fork --fork --exec benchmark improved 5.2% (12400->13100). Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
Radix flushes the TLB when updating ptes to increase permissiveness of protection (increase access authority). Book3S does not require TLB flushing in this case, and it is not done on hash. This patch avoids the flush for radix. >From Power ISA v3.0B, p.1090: Setting a Reference or Change Bit or Upgrading Access Authority (PTE Subject to Atomic Hardware Updates) If the only change being made to a valid PTE that is subject to atomic hardware updates is to set the Reference or Change bit to 1 or to add access authorities, a simpler sequence suffices because the translation hardware will refetch the PTE if an access is attempted for which the only problems were reference and/or change bits needing to be set or insufficient access authority. The nest MMU on POWER9 does not re-fetch the PTE after such an access attempt before faulting, so address spaces with a coprocessor attached will continue to flush in these cases. This reduces tlbies for a kernel compile workload from 1.28M to 0.95M, tlbiels from 20.17M 19.68M. fork --fork --exec benchmark improved 2.77% (12000->12300). Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
When relaxing access (read -> read_write update), pte needs to be marked invalid to handle a nest MMU bug. We also need to do a tlb flush after the pte is marked invalid before updating the pte with new access bits. We also move tlb flush to platform specific __ptep_set_access_flags. This will help us to gerid of unnecessary tlb flush on BOOK3S 64 later. We don't do that in this patch. This also helps in avoiding multiple tlbies with coprocessor attached. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
In later patch, we use the vma and psize to do tlb flush. Do the prototype update in separate patch to make the review easy. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
In later patch we will update them which require them to be moved to pgtable-radix.c. Keeping the function in radix.h results in compile warning as below. ./arch/powerpc/include/asm/book3s/64/radix.h: In function ‘radix__ptep_set_access_flags’: ./arch/powerpc/include/asm/book3s/64/radix.h:196:28: error: dereferencing pointer to incomplete type ‘struct vm_area_struct’ struct mm_struct *mm = vma->vm_mm; ^~ ./arch/powerpc/include/asm/book3s/64/radix.h:204:6: error: implicit declaration of function ‘atomic_read’; did you mean ‘__atomic_load’? [-Werror=implicit-function-declaration] atomic_read(&mm->context.copros) > 0) { ^~~~~~~~~~~ __atomic_load ./arch/powerpc/include/asm/book3s/64/radix.h:204:21: error: dereferencing pointer to incomplete type ‘struct mm_struct’ atomic_read(&mm->context.copros) > 0) { Instead of fixing header dependencies, we move the function to pgtable-radix.c Also the function is now large to be a static inline . Doing the move in separate patch helps in review. No functional change in this patch. Only code movement. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 15 5月, 2018 3 次提交
-
-
由 Aneesh Kumar K.V 提交于
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Aneesh Kumar K.V 提交于
In later patch we switch pmd_lock from mm->page_table_lock to split pmd ptlock. It avoid compilations issues, use pmd_lockptr helper. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 04 4月, 2018 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
With split PTL (page table lock) config, we allocate the level 4 (leaf) page table using pte fragment framework instead of slab cache like other levels. This was done to enable us to have split page table lock at the level 4 of the page table. We use page->plt backing the all the level 4 pte fragment for the lock. Currently with Radix, we use only 16 fragments out of the allocated page. In radix each fragment is 256 bytes which means we use only 4k out of the allocated 64K page wasting 60k of the allocated memory. This was done earlier to keep it closer to hash. This patch update the pte fragment count to 256, thereby using the full 64K page and reducing the memory usage. Performance tests shows really low impact even with THP disabled. With THP disabled we will be contenting further less on level 4 ptl and hence the impact should be further low. 256 threads: without patch (10 runs of ./ebizzy -m -n 1000 -s 131072 -S 100) median = 15678.5 stdev = 42.1209 with patch: median = 15354 stdev = 194.743 This is with THP disabled. With THP enabled the impact of the patch will be less. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 30 3月, 2018 3 次提交
-
-
由 Nicholas Piggin 提交于
Signed-off-by: NNicholas Piggin <npiggin@gmail.com> [mpe: Move __map_kernel_page_nid() inside #ifdef SPARSEMEM_VMEMMAP] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
Try to allocate kernel page tables for direct mapping and vmemmap according to the node of the memory they will map. The node is not available for the linear map in early boot, so use range allocation to allocate the page tables from the region they map, which is effectively node-local. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> [mpe: Fix build error in radix__create_section_mapping()] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 27 3月, 2018 1 次提交
-
-
Fix the warning messages for stop_machine_change_mapping(), and a number of other affected functions in its call chain. All modified functions are under CONFIG_MEMORY_HOTPLUG, so __meminit is okay (keeps them / does not discard them). Boot-tested on powernv/power9/radix-mmu and pseries/power8/hash-mmu. $ make -j$(nproc) CONFIG_DEBUG_SECTION_MISMATCH=y vmlinux ... MODPOST vmlinux.o WARNING: vmlinux.o(.text+0x6b130): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. WARNING: vmlinux.o(.text+0x6b13c): Section mismatch in reference from the function stop_machine_change_mapping() to the function .meminit.text:create_physical_mapping() The function stop_machine_change_mapping() references the function __meminit create_physical_mapping(). This is often because stop_machine_change_mapping lacks a __meminit annotation or the annotation of create_physical_mapping is wrong. ... Signed-off-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 13 2月, 2018 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
To support memory keys, we moved the hash pte slot information to the second half of the page table. This was ok with PTE entries at level 4 (PTE page) and level 3 (PMD). We already allocate larger page table pages at those levels to accomodate extra details. For level 4 we already have the extra space which was used to track 4k hash page table entry details and at level 3 the extra space was allocated to track the THP details. With hugetlbfs PTE, we used this extra space at the PMD level to store the slot details. But we also support hugetlbfs PTE at PUD level for 16GB pages and PUD level page didn't allocate extra space. This resulted in memory corruption. Fix this by allocating extra space at PUD level when HUGETLB is enabled. Fixes: bf9a95f9 ("powerpc: Free up four 64K PTE bits in 64K backed HPTE pages") Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: NRam Pai <linuxram@us.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 08 2月, 2018 2 次提交
-
-
由 Balbir Singh 提交于
This patch splits the linear mapping if the hot-unplug range is smaller than the mapping size. The code detects if the mapping needs to be split into a smaller size and if so, uses the stop machine infrastructure to clear the existing mapping and then remap the remaining range using a smaller page size. The code will skip any region of the mapping that overlaps with kernel text and warn about it once. We don't want to remove a mapping where the kernel text and the LMB we intend to remove overlap in the same TLB mapping as it may affect the currently executing code. I've tested these changes under a kvm guest with 2 vcpus, from a split mapping point of view, some of the caveats mentioned above applied to the testing I did. Fixes: 4b5d62ca ("powerpc/mm: add radix__remove_section_mapping()") Signed-off-by: NBalbir Singh <bsingharora@gmail.com> [mpe: Tweak change log to match updated behaviour] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
This change restores and formalises the behaviour that access to NULL or other user addresses by the kernel during boot should fault rather than succeed and modify memory. This was inadvertently broken when fixing another bug, because it was previously not well defined and only worked by chance. powerpc/64s/radix uses high address bits to select an address space "quadrant", which determines which PID and LPID are used to translate the rest of the address (effective PID, effective LPID). The kernel mapping at 0xC... selects quadrant 3, which uses PID=0 and LPID=0. So the kernel page tables are installed in the PID 0 process table entry. An address at 0x0... selects quadrant 0, which uses PID=PIDR for translating the rest of the address (that is, it uses the value of the PIDR register as the effective PID). If PIDR=0, then the translation is performed with the PID 0 process table entry page tables. This is the kernel mapping, so we effectively get another copy of the kernel address space at 0. A NULL pointer access will access physical memory address 0. To prevent duplicating the kernel address space in quadrant 0, this patch allocates a guard PID containing no translations, and initializes PIDR with this during boot, before the MMU is switched on. Any kernel access to quadrant 0 will use this guard PID for translation and find no valid mappings, and therefore fault. After boot, this PID will be switchd away to user context PIDs, but those contain user mappings (and usually NULL pointer protection) rather than kernel mapping, which is much safer (and by design). It may be in future this is tightened further, which the guard PID could be used for. Commit 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table"), introduced this problem because it zeroes PIDR at boot. However previously the value was inherited from firmware or kexec, which is not robust and can be zero (e.g., mambo). Fixes: 371b8044 ("powerpc/64s: Initialize ISAv3 MMU registers before setting partition table") Cc: stable@vger.kernel.org # v4.15+ Reported-by: NFlorian Weimer <fweimer@redhat.com> Tested-by: NMauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com> Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 17 1月, 2018 4 次提交
-
-
由 Nicholas Piggin 提交于
With the previous patch to switch to 64-bit mode after returning from RTAS and before doing any memory accesses, the RMA limit need not be clamped to 1GB to avoid RTAS bugs. Keep the 1GB limit for older firmware (although this is more of a kernel concern than RTAS), and remove it starting with POWER9. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
The radix guest is not subject to the paravirtualized HPT VRMA limit, so remove that from ppc64_rma_size calculation for that platform. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
This removes the RMA limit on powernv platform, which constrains early allocations such as PACAs and stacks. There are still other restrictions that must be followed, such as bolted SLB limits, but real mode addressing has no constraints. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Piggin 提交于
There are several cases outside the normal address space management where a CPU's entire local TLB is to be flushed: 1. Booting the kernel, in case something has left stale entries in the TLB (e.g., kexec). 2. Machine check, to clean corrupted TLB entries. One other place where the TLB is flushed, is waking from deep idle states. The flush is a side-effect of calling ->cpu_restore with the intention of re-setting various SPRs. The flush itself is unnecessary because in the first case, the TLB should not acquire new corrupted TLB entries as part of sleep/wake (though they may be lost). This type of TLB flush is coded inflexibly, several times for each CPU type, and they have a number of problems with ISA v3.0B: - The current radix mode of the MMU is not taken into account, it is always done as a hash flushn For IS=2 (LPID-matching flush from host) and IS=3 with HV=0 (guest kernel flush), tlbie(l) is undefined if the R field does not match the current radix mode. - ISA v3.0B hash must flush the partition and process table caches as well. - ISA v3.0B radix must flush partition and process scoped translations, partition and process table caches, and also the page walk cache. So consolidate the flushing code and implement it in C and inline asm under the mm/ directory with the rest of the flush code. Add ISA v3.0B cases for radix and hash, and use the radix flush in radix environment. Provide a way for IS=2 (LPID flush) to specify the radix mode of the partition. Have KVM pass in the radix mode of the guest. Take out the flushes from early cputable/dt_cpu_ftrs detection hooks, and move it later in the boot process after, the MMU registers are set up and before relocation is first turned on. The TLB flush is no longer called when restoring from deep idle states. This was not be done as a separate step because booting secondaries uses the same cpu_restore as idle restore, which needs the TLB flush. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 12 11月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
When using the radix MMU on Power9 DD1, to work around a hardware problem, radix__pte_update() is required to do a two stage update of the PTE. First we write a zero value into the PTE, then we flush the TLB, and then we write the new PTE value. In the normal case that works OK, but it does not work if we're updating the PTE that maps the code we're executing, because the mapping is removed by the TLB flush and we can no longer execute from it. Unfortunately the STRICT_RWX code needs to do exactly that. The exact symptoms when we hit this case vary, sometimes we print an oops and then get stuck after that, but I've also seen a machine just get stuck continually page faulting with no oops printed. The variance is presumably due to the exact layout of the text and the page size used for the mappings. In all cases we are unable to boot to a shell. There are possible solutions such as creating a second mapping of the TLB flush code, executing from that, and then jumping back to the original. However we don't want to add that level of complexity for a DD1 work around. So just detect that we're running on Power9 DD1 and refrain from changing the permissions, effectively disabling STRICT_RWX on Power9 DD1. Fixes: 7614ff32 ("powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix") Cc: stable@vger.kernel.org # v4.13+ Reported-by: NAndrew Jeffery <andrew@aj.id.au> [Changelog as suggested by Michael Ellerman <mpe@ellerman.id.au>] Signed-off-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 31 8月, 2017 2 次提交
-
-
由 Michael Ellerman 提交于
When we map memory at boot we print out the ranges of real addresses that we mapped and the page size that was used. Currently it's a bit ugly: Mapped range 0x0 - 0x2000000000 with 0x40000000 Mapped range 0x200000000000 - 0x202000000000 with 0x40000000 Pad the addresses so they line up, and print the page size using actual units, eg: Mapped 0x0000000000000000-0x0000000001200000 with 64.0 KiB pages Mapped 0x0000000001200000-0x0000000040000000 with 2.00 MiB pages Mapped 0x0000000040000000-0x0000000100000000 with 1.00 GiB pages Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
Make the printks look a bit nicer by adding a prefix. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 17 8月, 2017 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
Now that we made sure that lockless walk of linux page table is mostly limitted to current task(current->mm->pgdir) we can update the THP update sequence to only send IPI to CPUs on which this task has run. This helps in reducing the IPI overload on systems with large number of CPUs. WRT kvm even though kvm is walking page table with vpc->arch.pgdir, it is done only on secondary CPUs and in that case we have primary CPU added to task's mm cpumask. Sending an IPI to primary will force the secondary to do a vm exit and hence this mm cpumask usage is safe here. WRT CAPI, we still end up walking linux page table with capi context MM. For now the pte lookup serialization sends an IPI to all CPUs in CPI is in use. We can further improve this by adding the CAPI interrupt handling CPU to task mm cpumask. That will be done in a later patch. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 10 8月, 2017 1 次提交
-
-
由 Suraj Jitindar Singh 提交于
The host process table base is stored in the partition table by calling the function native_register_process_table(). Currently this just sets the entry in memory and is missing a subsequent cache invalidation instruction. Any update to the partition table should be followed by a cache invalidation instruction specifying invalidation of the caching of any partition table entries (RIC = 2, PRS = 0). We already have a function to update the partition table with the required cache invalidation instructions - mmu_partition_table_set_entry(). Update the native_register_process_table() function to call mmu_partition_table_set_entry(), this ensures all appropriate invalidation will be performed. Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [mpe: Use a local for patb0 to clean it up slightly] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 08 8月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Currently KERN_IO_START is defined as: #define KERN_IO_START (KERN_VIRT_START + (KERN_VIRT_SIZE >> 1)) Although it looks like a constant, both the components are actually variables, to allow us to have a different value between Radix and Hash with a single kernel. However that still requires both Radix and Hash to place the kernel IO region at the same location relative to the start and end of the kernel virtual region (namely 1/2 way through it), and we'd like to change that. So split KERN_IO_START out into its own variable, and initialise it for Radix and Hash. In the medium term we should be able to reconsolidate this, by doing a more involved rearrangement of the location of the regions. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 02 8月, 2017 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
We do that because it's used by THP pmd collapsing, so use instead a dedicated flush function. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 26 7月, 2017 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
There's a somewhat architectural issue with Radix MMU and KVM. When coming out of a guest with AIL (Alternate Interrupt Location, ie, MMU enabled), we start executing hypervisor code with the PID register still containing whatever the guest has been using. The problem is that the CPU can (and will) then start prefetching or speculatively load from whatever host context has that same PID (if any), thus bringing translations for that context into the TLB, which Linux doesn't know about. This can cause stale translations and subsequent crashes. Fixing this in a way that is neither racy nor a huge performance impact is difficult. We could just make the host invalidations always use broadcast forms but that would hurt single threaded programs for example. We chose to fix it instead by partitioning the PID space between guest and host. This is possible because today Linux only use 19 out of the 20 bits of PID space, so existing guests will work if we make the host use the top half of the 20 bits space. We additionally add support for a property to indicate to Linux the size of the PID register which will be useful if we eventually have processors with a larger PID space available. There is still an issue with malicious guests purposefully setting the PID register to a value in the hosts PID range. Hopefully future HW can prevent that, but in the meantime, we handle it with a pair of kludges: - On the way out of a guest, before we clear the current VCPU in the PACA, we check the PID and if it's outside of the permitted range we flush the TLB for that PID. - When context switching, if the mm is "new" on that CPU (the corresponding bit was set for the first time in the mm cpumask), we check if any sibling thread is in KVM (has a non-NULL VCPU pointer in the PACA). If that is the case, we also flush the PID for that CPU (core). This second part is needed to handle the case where a process is migrated (or starts a new pthread) on a sibling thread of the CPU coming out of KVM, as there's a window where stale translations can exist before we detect it and flush them out. A future optimization could be added by keeping track of whether the PID has ever been used and avoid doing that for completely fresh PIDs. We could similarily mark PIDs that have been the subject of a global invalidation as "fresh". But for now this will do. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> [mpe: Rework the asm to build with CONFIG_PPC_RADIX_MMU=n, drop unneeded include of kvm_book3s_asm.h] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 18 7月, 2017 2 次提交
-
-
由 Michael Ellerman 提交于
Currently even with STRICT_KERNEL_RWX we leave the __init text marked executable after init, which is bad. Add a hook to mark it NX (no-execute) before we free it, and implement it for radix and hash. Note that we use __init_end as the end address, not _einittext, because overlaps_kernel_text() uses __init_end, because there are additional executable sections other than .init.text between __init_begin and __init_end. Tested on radix and hash with: 0:mon> p $__init_begin *** 400 exception occurred Fixes: 1e0fc9d1 ("powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs") Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Michael Ellerman 提交于
Move the core logic into a helper, so we can use it for changing permissions other than _PAGE_WRITE. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 04 7月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
The Radix linear mapping code (create_physical_mapping()) tries to use the largest page size it can at each step. Currently the only reason it steps down to a smaller page size is if the start addr is unaligned (never happens in practice), or the end of memory is not aligned to a huge page boundary. To support STRICT_RWX we need to break the mapping at __init_begin, so that the text and rodata prior to that can be marked R_X and the regular pages after can be marked RW. Having done that we can now implement mark_rodata_ro() for Radix, knowing that we won't need to split any mappings. Signed-off-by: NBalbir Singh <bsingharora@gmail.com> [mpe: Split down to PAGE_SIZE, not 2MB, rewrite change log] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 03 7月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
Commit 9abcc981 ("powerpc/mm/radix: Only add X for pages overlapping kernel text") changed the linear mapping on Radix to only mark the kernel text executable. However if the kernel is run relocated, for example as a kdump kernel, then the exception vectors are split from the kernel text, ie. they remain at real address 0. We tend to get away with it, because the kernel itself will usually be below 1G, which means the 1G page at 0-1G is marked executable and everything works OK. However if the kernel is loaded above 1G, or the system has less than 1G in total (meaning we can't use a 1G page), then the exception vectors will not be marked executable and the kernel will fail to boot. Fix it by also checking if the address range overlaps the exception vectors when deciding if we should add PAGE_KERNEL_X. Fixes: 9abcc981 ("powerpc/mm/radix: Only add X for pages overlapping kernel text") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: NBalbir Singh <bsingharora@gmail.com> [mpe: Combine with the existing check, rewrite change log] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 02 7月, 2017 1 次提交
-
-
由 Oliver O'Halloran 提交于
Add support for the devmap bit on PTEs and PMDs for PPC64 Book3S. This is used to differentiate device backed memory from transparent huge pages since they are handled in more or less the same manner by the core mm code. Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NOliver O'Halloran <oohall@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 6月, 2017 1 次提交
-
-
由 Balbir Singh 提交于
Add a trace point for tlbie(l) (Translation Lookaside Buffer Invalidate Entry (Local)) instructions. The tlbie instruction has changed over the years, so not all versions accept the same operands. Use the ISA v3 field operands because they are the most verbose, we may change them in future. Example output: qemu-system-ppc-5371 [016] 1412.369519: tlbie: tlbie with lpid 0, local 1, rb=67bd8900174c11c1, rs=0, ric=0 prs=0 r=0 Signed-off-by: NBalbir Singh <bsingharora@gmail.com> [mpe: Add some missing trace_tlbie()s, reword change log] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 15 6月, 2017 1 次提交
-
-
由 Michael Ellerman 提交于
Currently we map the whole linear mapping with PAGE_KERNEL_X. Instead we should check if the page overlaps the kernel text and only then add PAGE_KERNEL_X. Note that we still use 1G pages if they're available, so this will typically still result in a 1G executable page at KERNELBASE. So this fix is primarily useful for catching stray branches to high linear mapping addresses. Without this patch, we can execute at 1G in xmon using: 0:mon> m c000000040000000 c000000040000000 00 l c000000040000000 00000000 01006038 c000000040000004 00000000 2000804e c000000040000008 00000000 x 0:mon> di c000000040000000 c000000040000000 38600001 li r3,1 c000000040000004 4e800020 blr 0:mon> p c000000040000000 return value is 0x1 After we get a 400 as expected: 0:mon> p c000000040000000 *** 400 exception occurred Fixes: 2bfd65e4 ("powerpc/mm/radix: Add radix callbacks for early init routines") Cc: stable@vger.kernel.org # v4.7+ Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NBalbir Singh <bsingharora@gmail.com>
-
- 03 3月, 2017 1 次提交
-
-
由 Paul Mackerras 提交于
The POWER9 MMU reads and caches entries from the process table. When we kexec from one kernel to another, the second kernel sets its process table pointer but doesn't currently do anything to make the CPU invalidate any cached entries from the old process table. This adds a tlbie (TLB invalidate entry) instruction with parameters to invalidate caching of the process table after the new process table is installed. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-