- 21 4月, 2015 2 次提交
-
-
由 Aneesh Kumar K.V 提交于
This adds helper routines for locking and unlocking HPTEs, and uses them in the rest of the code. We don't change any locking rules in this patch. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
We don't support real-mode areas now that 970 support is removed. Remove the remaining details of rma from the code. Also rename rma_setup_done to hpte_setup_done to better reflect the changes. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 17 12月, 2014 2 次提交
-
-
由 Paul Mackerras 提交于
This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Suresh E. Warrier 提交于
This patch adds trace points in the guest entry and exit code and also for exceptions handled by the host in kernel mode - hypercalls and page faults. The new events are added to /sys/kernel/debug/tracing/events under a new subsystem called kvm_hv. Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 15 12月, 2014 2 次提交
-
-
由 Cédric Le Goater 提交于
When being restored from qemu, the kvm_get_htab_header are in native endian, but the ptes are big endian. This patch fixes restore on a KVM LE host. Qemu also needs a fix for this : http://lists.nongnu.org/archive/html/qemu-ppc/2014-11/msg00008.htmlSigned-off-by: NCédric Le Goater <clg@fr.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
In kvm_test_clear_dirty(), if we find an invalid HPTE we move on to the next HPTE without unlocking the invalid one. In fact we should never find an invalid and unlocked HPTE in the rmap chain, but for robustness we should unlock it. This adds the missing unlock. Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 24 9月, 2014 1 次提交
-
-
由 Andres Lagar-Cavilla 提交于
1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: NAndres Lagar-Cavilla <andreslc@google.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 03 9月, 2014 1 次提交
-
-
由 Laurent Dufour 提交于
fc95ca72 introduces a memset in kvmppc_alloc_hpt since the general CMA doesn't clear the memory it allocates. However, the size argument passed to memset is computed from a signed value and its signed bit is extended by the cast the compiler is doing. This lead to extremely large size value when dealing with order value >= 31, and almost all the memory following the allocated space is cleaned. As a consequence, the system is panicing and may even fail spawning the kdump kernel. This fix makes use of an unsigned value for the memset's size argument to avoid sign extension. Among this fix, another shift operation which may lead to signed extended value too is also fixed. Cc: Alexey Kardashevskiy <aik@ozlabs.ru> Cc: Paul Mackerras <paulus@samba.org> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 8月, 2014 1 次提交
-
-
由 Joonsoo Kim 提交于
Now, we have general CMA reserved area management framework, so use it for future maintainabilty. There is no functional change. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Alexander Graf <agraf@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Gleb Natapov <gleb@kernel.org> Acked-by: NMarek Szyprowski <m.szyprowski@samsung.com> Tested-by: NMarek Szyprowski <m.szyprowski@samsung.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 7月, 2014 2 次提交
-
-
由 Mihai Caraman 提交于
On book3e, guest last instruction is read on the exit path using load external pid (lwepx) dedicated instruction. This load operation may fail due to TLB eviction and execute-but-not-read entries. This patch lay down the path for an alternative solution to read the guest last instruction, by allowing kvmppc_get_lat_inst() function to fail. Architecture specific implmentations of kvmppc_load_last_inst() may read last guest instruction and instruct the emulation layer to re-execute the guest in case of failure. Make kvmppc_get_last_inst() definition common between architectures. Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
When running on an LE host all data structures are kept in little endian byte order. However, the HTAB still needs to be maintained in big endian. So every time we access any HTAB we need to make sure we do so in the right byte order. Fix up all accesses to manually byte swap. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 25 6月, 2014 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
With guests supporting Multiple page size per segment (MPSS), hpte_page_size returns the actual page size used. Add a new function to return base page size and use that to compare against the the page size calculated from SLB. Without this patch a hpte lookup can fail since we are comparing wrong page size in kvmppc_hv_find_lock_hpte. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 30 5月, 2014 4 次提交
-
-
由 Paul Mackerras 提交于
Current, when testing whether a page is dirty (when constructing the bitmap for the KVM_GET_DIRTY_LOG ioctl), we test the C (changed) bit in the HPT entries mapping the page, and if it is 0, we consider the page to be clean. However, the Power ISA doesn't require processors to set the C bit to 1 immediately when writing to a page, and in fact allows them to delay the writeback of the C bit until they receive a TLB invalidation for the page. Thus it is possible that the page could be dirty and we miss it. Now, if there are vcpus running, this is not serious since the collection of the dirty log is racy already - some vcpu could dirty the page just after we check it. But if there are no vcpus running we should return definitive results, in case we are in the final phase of migrating the guest. Also, if the permission bits in the HPTE don't allow writing, then we know that no CPU can set C. If the HPTE was previously writable and the page was modified, any C bit writeback would have been flushed out by the tlbie that we did when changing the HPTE to read-only. Otherwise we need to do a TLB invalidation even if the C bit is 0, and then check the C bit. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexey Kardashevskiy 提交于
The dirty map that we construct for the KVM_GET_DIRTY_LOG ioctl has one bit per system page (4K/64K). Currently, we only set one bit in the map for each HPT entry with the Change bit set, even if the HPT is for a large page (e.g., 16MB). Userspace then considers only the first system page dirty, though in fact the guest may have modified anywhere in the large page. To fix this, we make kvm_test_clear_dirty() return the actual number of pages that are dirty (and rename it to kvm_test_clear_dirty_npages() to emphasize that that's what it returns). In kvmppc_hv_get_dirty_log() we then set that many bits in the dirty map. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently, when a huge page is faulted in for a guest, we select the rmap chain to insert the HPTE into based on the guest physical address that the guest tried to access. Since there is an rmap chain for each system page, there are many rmap chains for the area covered by a huge page (e.g. 256 for 16MB pages when PAGE_SIZE = 64kB), and the huge-page HPTE could end up in any one of them. For consistency, and to make the huge-page HPTEs easier to find, we now put huge-page HPTEs in the rmap chain corresponding to the base address of the huge page. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
Today when KVM tries to reserve memory for the hash page table it allocates from the normal page allocator first. If that fails it falls back to CMA's reserved region. One of the side effects of this is that we could end up exhausting the page allocator and get linux into OOM conditions while we still have plenty of space available in CMA. This patch addresses this issue by first trying hash page table allocation from CMA's reserved region before falling back to the normal page allocator. So if we run out of memory, we really are out of memory. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 29 3月, 2014 1 次提交
-
-
由 Michael Neuling 提交于
This adds saving of the transactional memory (TM) checkpointed state on guest entry and exit. We only do this if we see that the guest has an active transaction. It also adds emulation of the TM state changes when delivering IRQs into the guest. According to the architecture, if we are transactional when an IRQ occurs, the TM state is changed to suspended, otherwise it's left unchanged. Signed-off-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NPaul Mackerras <paulus@samba.org> Acked-by: NScott Wood <scottwood@freescale.com>
-
- 27 1月, 2014 2 次提交
-
-
由 Anton Blanchard 提交于
We create a guest MSR from scratch when delivering exceptions in a few places. Instead of extracting LPCR[ILE] and inserting it into MSR_LE each time, we simply create a new variable intr_msr which contains the entire MSR to use. For a little-endian guest, userspace needs to set the ILE (interrupt little-endian) bit in the LPCR for each vcpu (or at least one vcpu in each virtual core). [paulus@samba.org - removed H_SET_MODE implementation from original version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.] Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Cédric Le Goater 提交于
MMIO emulation reads the last instruction executed by the guest and then emulates. If the guest is running in Little Endian order, or more generally in a different endian order of the host, the instruction needs to be byte-swapped before being emulated. This patch adds a helper routine which tests the endian order of the host and the guest in order to decide whether a byteswap is needed or not. It is then used to byteswap the last instruction of the guest in the endian order of the host before MMIO emulation is performed. Finally, kvmppc_handle_load() of kvmppc_handle_store() are modified to reverse the endianness of the MMIO if required. Signed-off-by: NCédric Le Goater <clg@fr.ibm.com> [agraf: add booke handling] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 19 11月, 2013 2 次提交
-
-
由 pingfan liu 提交于
Since kvmppc_hv_find_lock_hpte() is called from both virtmode and realmode, so it can trigger the deadlock. Suppose the following scene: Two physical cpuM, cpuN, two VM instances A, B, each VM has a group of vcpus. If on cpuM, vcpu_A_1 holds bitlock X (HPTE_V_HVLOCK), then is switched out, and on cpuN, vcpu_A_2 try to lock X in realmode, then cpuN will be caught in realmode for a long time. What makes things even worse if the following happens, On cpuM, bitlockX is hold, on cpuN, Y is hold. vcpu_B_2 try to lock Y on cpuM in realmode vcpu_A_2 try to lock X on cpuN in realmode Oops! deadlock happens Signed-off-by: NLiu Ping Fan <pingfank@linux.vnet.ibm.com> Reviewed-by: NPaul Mackerras <paulus@samba.org> CC: stable@vger.kernel.org Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This fixes a bug in kvmppc_do_h_enter() where the physical address for a page can be calculated incorrectly if transparent huge pages (THP) are active. Until THP came along, it was true that if we encountered a large (16M) page in kvmppc_do_h_enter(), then the associated memslot must be 16M aligned for both its guest physical address and the userspace address, and the physical address calculations in kvmppc_do_h_enter() assumed that. With THP, that is no longer true. In the case where we are using MMU notifiers and the page size that we get from the Linux page tables is larger than the page being mapped by the guest, we need to fill in some low-order bits of the physical address. Without THP, these bits would be the same in the guest physical address (gpa) and the host virtual address (hva). With THP, they can be different, and we need to use the bits from hva rather than gpa. In the case where we are not using MMU notifiers, the host physical address we get from the memslot->arch.slot_phys[] array already includes the low-order bits down to the PAGE_SIZE level, even if we are using large pages. Thus we can simplify the calculation in this case to just add in the remaining bits in the case where PAGE_SIZE is 64k and the guest is mapping a 4k page. The same bug exists in kvmppc_book3s_hv_page_fault(). The basic fix is to use psize (the page size from the HPTE) rather than pte_size (the page size from the Linux PTE) when updating the HPTE low word in r. That means that pfn needs to be computed to PAGE_SIZE granularity even if the Linux PTE is a huge page PTE. That can be arranged simply by doing the page_to_pfn() before setting page to the head of the compound page. If psize is less than PAGE_SIZE, then we need to make sure we only update the bits from PAGE_SIZE upwards, in order not to lose any sub-page offset bits in r. On the other hand, if psize is greater than PAGE_SIZE, we need to make sure we don't bring in non-zero low order bits in pfn, hence we mask (pfn << PAGE_SHIFT) with ~(psize - 1). Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 17 10月, 2013 3 次提交
-
-
由 Aneesh Kumar K.V 提交于
This patch add a new callback kvmppc_ops. This will help us in enabling both HV and PR KVM together in the same kernel. The actual change to enable them together is done in the later patch in the series. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in booke changes] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently we request write access to all pages that get mapped into the guest, even if the guest is only loading from the page. This reduces the effectiveness of KSM because it means that we unshare every page we access. Also, we always set the changed (C) bit in the guest HPTE if it allows writing, even for a guest load. This fixes both these problems. We pass an 'iswrite' flag to the mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether the access is a load or a store. The mmu.xlate() functions now only set C for stores. kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot() instead of gfn_to_pfn() so that it can indicate whether we need write access to the page, and get back a 'writable' flag to indicate whether the page is writable or not. If that 'writable' flag is clear, we then make the host HPTE read-only even if the guest HPTE allowed writing. This means that we can get a protection fault when the guest writes to a page that it has mapped read-write but which is read-only on the host side (perhaps due to KSM having merged the page). Thus we now call kvmppc_handle_pagefault() for protection faults as well as HPTE not found faults. In kvmppc_handle_pagefault(), if the access was allowed by the guest HPTE and we thus need to install a new host HPTE, we then need to remove the old host HPTE if there is one. This is done with a new function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to find and remove the old host HPTE. Since the memslot-related functions require the KVM SRCU read lock to be held, this adds srcu_read_lock/unlock pairs around the calls to kvmppc_handle_pagefault(). Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore guest HPTEs that don't permit access, and to return -EPERM for accesses that are not permitted by the page protections. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This adds the ability to have a separate LPCR (Logical Partitioning Control Register) value relating to a guest for each virtual core, rather than only having a single value for the whole VM. This corresponds to what real POWER hardware does, where there is a LPCR per CPU thread but most of the fields are required to have the same value on all active threads in a core. The per-virtual-core LPCR can be read and written using the GET/SET_ONE_REG interface. Userspace can can only modify the following fields of the LPCR value: DPFD Default prefetch depth ILE Interrupt little-endian TC Translation control (secondary HPT hash group search disable) We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which contains bits relating to memory management, i.e. the Virtualized Partition Memory (VPM) bits and the bits relating to guest real mode. When this default value is updated, the update needs to be propagated to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do that. Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: fix whitespace] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 26 8月, 2013 1 次提交
-
-
由 Yann Droneaud 提交于
KVM uses anon_inode_get() to allocate file descriptors as part of some of its ioctls. But those ioctls are lacking a flag argument allowing userspace to choose options for the newly opened file descriptor. In such case it's advised to use O_CLOEXEC by default so that userspace is allowed to choose, without race, if the file descriptor is going to be inherited across exec(). This patch set O_CLOEXEC flag on all file descriptors created with anon_inode_getfd() to not leak file descriptors across exec(). Signed-off-by: NYann Droneaud <ydroneaud@opteya.com> Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.comReviewed-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NGleb Natapov <gleb@redhat.com>
-
- 08 7月, 2013 2 次提交
-
-
由 Aneesh Kumar K.V 提交于
Both RMA and hash page table request will be a multiple of 256K. We can use a chunk size of 256K to track the free/used 256K chunk in the bitmap. This should help to reduce the bitmap size. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
Powerpc architecture uses a hash based page table mechanism for mapping virtual addresses to physical address. The architecture require this hash page table to be physically contiguous. With KVM on Powerpc currently we use early reservation mechanism for allocating guest hash page table. This implies that we need to reserve a big memory region to ensure we can create large number of guest simultaneously with KVM on Power. Another disadvantage is that the reserved memory is not available to rest of the subsystems and and that implies we limit the total available memory in the host. This patch series switch the guest hash page table allocation to use contiguous memory allocator. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 21 6月, 2013 1 次提交
-
-
由 Aneesh Kumar K.V 提交于
We can find pte that are splitting while walking page tables. Return None pte in that case. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 27 4月, 2013 2 次提交
-
-
由 Paul Mackerras 提交于
At present, the KVM_GET_DIRTY_LOG ioctl doesn't report modifications done by the host to the virtual processor areas (VPAs) and dispatch trace logs (DTLs) registered by the guest. This is because those modifications are done either in real mode or in the host kernel context, and in neither case does the access go through the guest's HPT, and thus no change (C) bit gets set in the guest's HPT. However, the changes done by the host do need to be tracked so that the modified pages get transferred when doing live migration. In order to track these modifications, this adds a dirty flag to the struct representing the VPA/DTL areas, and arranges to set the flag when the VPA/DTL gets modified by the host. Then, when we are collecting the dirty log, we also check the dirty flags for the VPA and DTL for each vcpu and set the relevant bit in the dirty log if necessary. Doing this also means we now need to keep track of the guest physical address of the VPA/DTL areas. So as not to lose track of modifications to a VPA/DTL area when it gets unregistered, or when a new area gets registered in its place, we need to transfer the dirty state to the rmap chain. This adds code to kvmppc_unpin_guest_page() to do that if the area was dirty. To simplify that code, we now require that all VPA, DTL and SLB shadow buffer areas fit within a single host page. Guests already comply with this requirement because pHyp requires that these areas not cross a 4k boundary. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
At present, the code that determines whether a HPT entry has changed, and thus needs to be sent to userspace when it is copying the HPT, doesn't consider a hardware update to the reference and change bits (R and C) in the HPT entries to constitute a change that needs to be sent to userspace. This adds code to check for changes in R and C when we are scanning the HPT to find changed entries, and adds code to set the changed flag for the HPTE when we update the R and C bits in the guest view of the HPTE. Since we now need to set the HPTE changed flag in book3s_64_mmu_hv.c as well as book3s_hv_rm_mmu.c, we move the note_hpte_modification() function into kvm_book3s_64.h. Current Linux guest kernels don't use the hardware updates of R and C in the HPT, so this change won't affect them. Linux (or other) kernels might in future want to use the R and C bits and have them correctly transferred across when a guest is migrated, so it is better to correct this deficiency. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 10 4月, 2013 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 06 12月, 2012 5 次提交
-
-
由 Paul Mackerras 提交于
When we change or remove a HPT (hashed page table) entry, we can do either a global TLB invalidation (tlbie) that works across the whole machine, or a local invalidation (tlbiel) that only affects this core. Currently we do local invalidations if the VM has only one vcpu or if the guest requests it with the H_LOCAL flag, though the guest Linux kernel currently doesn't ever use H_LOCAL. Then, to cope with the possibility that vcpus moving around to different physical cores might expose stale TLB entries, there is some code in kvmppc_hv_entry to flush the whole TLB of entries for this VM if either this vcpu is now running on a different physical core from where it last ran, or if this physical core last ran a different vcpu. There are a number of problems on POWER7 with this as it stands: - The TLB invalidation is done per thread, whereas it only needs to be done per core, since the TLB is shared between the threads. - With the possibility of the host paging out guest pages, the use of H_LOCAL by an SMP guest is dangerous since the guest could possibly retain and use a stale TLB entry pointing to a page that had been removed from the guest. - The TLB invalidations that we do when a vcpu moves from one physical core to another are unnecessary in the case of an SMP guest that isn't using H_LOCAL. - The optimization of using local invalidations rather than global should apply to guests with one virtual core, not just one vcpu. (None of this applies on PPC970, since there we always have to invalidate the whole TLB when entering and leaving the guest, and we can't support paging out guest memory.) To fix these problems and simplify the code, we now maintain a simple cpumask of which cpus need to flush the TLB on entry to the guest. (This is indexed by cpu, though we only ever use the bits for thread 0 of each core.) Whenever we do a local TLB invalidation, we set the bits for every cpu except the bit for thread 0 of the core that we're currently running on. Whenever we enter a guest, we test and clear the bit for our core, and flush the TLB if it was set. On initial startup of the VM, and when resetting the HPT, we set all the bits in the need_tlb_flush cpumask, since any core could potentially have stale TLB entries from the previous VM to use the same LPID, or the previous contents of the HPT. Then, we maintain a count of the number of online virtual cores, and use that when deciding whether to use a local invalidation rather than the number of online vcpus. The code to make that decision is extracted out into a new function, global_invalidates(). For multi-core guests on POWER7 (i.e. when we are using mmu notifiers), we now never do local invalidations regardless of the H_LOCAL flag. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This fixes a bug in the code which allows userspace to read out the contents of the guest's hashed page table (HPT). On the second and subsequent passes through the HPT, when we are reporting only those entries that have changed, we were incorrectly initializing the index field of the header with the index of the first entry we skipped rather than the first changed entry. This fixes it. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
With HV-style KVM, we maintain reverse-mapping lists that enable us to find all the HPT (hashed page table) entries that reference each guest physical page, with the heads of the lists in the memslot->arch.rmap arrays. When we reset the HPT (i.e. when we reboot the VM), we clear out all the HPT entries but we were not clearing out the reverse mapping lists. The result is that as we create new HPT entries, the lists get corrupted, which can easily lead to loops, resulting in the host kernel hanging when it tries to traverse those lists. This fixes the problem by zeroing out all the reverse mapping lists when we zero out the HPT. This incidentally means that we are also zeroing our record of the referenced and changed bits (not the bits in the Linux PTEs, used by the Linux MM subsystem, but the bits used by the KVM_GET_DIRTY_LOG ioctl, and those used by kvm_age_hva() and kvm_test_age_hva()). Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on this fd return the contents of the HPT (hashed page table), writes create and/or remove entries in the HPT. There is a new capability, KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl takes an argument structure with the index of the first HPT entry to read out and a set of flags. The flags indicate whether the user is intending to read or write the HPT, and whether to return all entries or only the "bolted" entries (those with the bolted bit, 0x10, set in the first doubleword). This is intended for use in implementing qemu's savevm/loadvm and for live migration. Therefore, on reads, the first pass returns information about all HPTEs (or all bolted HPTEs). When the first pass reaches the end of the HPT, it returns from the read. Subsequent reads only return information about HPTEs that have changed since they were last read. A read that finds no changed HPTEs in the HPT following where the last read finished will return 0 bytes. The format of the data provides a simple run-length compression of the invalid entries. Each block of data starts with a header that indicates the index (position in the HPT, which is just an array), the number of valid entries starting at that index (may be zero), and the number of invalid entries following those valid entries. The valid entries, 16 bytes each, follow the header. The invalid entries are not explicitly represented. Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: fix documentation] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This restructures the code that creates HPT (hashed page table) entries so that it can be called in situations where we don't have a struct vcpu pointer, only a struct kvm pointer. It also fixes a bug where kvmppc_map_vrma() would corrupt the guest R4 value. Most of the work of kvmppc_virtmode_h_enter is now done by a new function, kvmppc_virtmode_do_h_enter, which itself calls another new function, kvmppc_do_h_enter, which contains most of the old kvmppc_h_enter. The new kvmppc_do_h_enter takes explicit arguments for the place to return the HPTE index, the Linux page tables to use, and whether it is being called in real mode, thus removing the need for it to have the vcpu as an argument. Currently kvmppc_map_vrma creates the VRMA (virtual real mode area) HPTEs by calling kvmppc_virtmode_h_enter, which is designed primarily to handle H_ENTER hcalls from the guest that need to pin a page of memory. Since H_ENTER returns the index of the created HPTE in R4, kvmppc_virtmode_h_enter updates the guest R4, corrupting the guest R4 in the case when it gets called from kvmppc_map_vrma on the first VCPU_RUN ioctl. With this, kvmppc_map_vrma instead calls kvmppc_virtmode_do_h_enter with the address of a dummy word as the place to store the HPTE index, thus avoiding corrupting the guest R4. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 23 10月, 2012 1 次提交
-
-
由 Christoffer Dall 提交于
The mmu_notifier_retry is not specific to any vcpu (and never will be) so only take struct kvm as a parameter. The motivation is the ARM mmu code that needs to call this from somewhere where we long let go of the vcpu pointer. Signed-off-by: NChristoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 06 10月, 2012 3 次提交
-
-
由 Paul Mackerras 提交于
In the case where the host kernel is using a 64kB base page size and the guest uses a 4k HPTE (hashed page table entry) to map an emulated MMIO device, we were calculating the guest physical address wrongly. We were calculating a gfn as the guest physical address shifted right 16 bits (PAGE_SHIFT) but then only adding back in 12 bits from the effective address, since the HPTE had a 4k page size. Thus the gpa reported to userspace was missing 4 bits. Instead, we now compute the guest physical address from the HPTE without reference to the host page size, and then compute the gfn by shifting the gpa right PAGE_SHIFT bits. Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This adds an implementation of kvm_arch_flush_shadow_memslot for Book3S HV, and arranges for kvmppc_core_commit_memory_region to flush the dirty log when modifying an existing slot. With this, we can handle deletion and modification of memory slots. kvm_arch_flush_shadow_memslot calls kvmppc_core_flush_memslot, which on Book3S HV now traverses the reverse map chains to remove any HPT (hashed page table) entries referring to pages in the memslot. This gets called by generic code whenever deleting a memslot or changing the guest physical address for a memslot. We flush the dirty log in kvmppc_core_commit_memory_region for consistency with what x86 does. We only need to flush when an existing memslot is being modified, because for a new memslot the rmap array (which stores the dirty bits) is all zero, meaning that every page is considered clean already, and when deleting a memslot we obviously don't care about the dirty bits any more. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Now that we have an architecture-specific field in the kvm_memory_slot structure, we can use it to store the array of page physical addresses that we need for Book3S HV KVM on PPC970 processors. This reduces the size of struct kvm_arch for Book3S HV, and also reduces the size of struct kvm_arch_memory_slot for other PPC KVM variants since the fields in it are now only compiled in for Book3S HV. This necessitates making the kvm_arch_create_memslot and kvm_arch_free_memslot operations specific to each PPC KVM variant. That in turn means that we now don't allocate the rmap arrays on Book3S PR and Book E. Since we now unpin pages and free the slot_phys array in kvmppc_core_free_memslot, we no longer need to do it in kvmppc_core_destroy_vm, since the generic code takes care to free all the memslots when destroying a VM. We now need the new memslot to be passed in to kvmppc_core_prepare_memory_region, since we need to initialize its arch.slot_phys member on Book3S HV. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-