- 08 1月, 2013 1 次提交
-
-
由 Cornelia Huck 提交于
Add support for handling I/O interrupts (standard, subchannel-related ones and rudimentary adapter interrupts). The subchannel-identifying parameters are encoded into the interrupt type. I/O interrupts are floating, so they can't be injected on a specific vcpu. Reviewed-by: NAlexander Graf <agraf@suse.de> Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 06 12月, 2012 2 次提交
-
-
由 Mihai Caraman 提交于
Implement ONE_REG interface for EPCR register adding KVM_REG_PPC_EPCR to the list of ONE_REG PPC supported registers. Signed-off-by: NMihai Caraman <mihai.caraman@freescale.com> [agraf: remove HV dependency, use get/put_user] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
A new ioctl, KVM_PPC_GET_HTAB_FD, returns a file descriptor. Reads on this fd return the contents of the HPT (hashed page table), writes create and/or remove entries in the HPT. There is a new capability, KVM_CAP_PPC_HTAB_FD, to indicate the presence of the ioctl. The ioctl takes an argument structure with the index of the first HPT entry to read out and a set of flags. The flags indicate whether the user is intending to read or write the HPT, and whether to return all entries or only the "bolted" entries (those with the bolted bit, 0x10, set in the first doubleword). This is intended for use in implementing qemu's savevm/loadvm and for live migration. Therefore, on reads, the first pass returns information about all HPTEs (or all bolted HPTEs). When the first pass reaches the end of the HPT, it returns from the read. Subsequent reads only return information about HPTEs that have changed since they were last read. A read that finds no changed HPTEs in the HPT following where the last read finished will return 0 bytes. The format of the data provides a simple run-length compression of the invalid entries. Each block of data starts with a header that indicates the index (position in the HPT, which is just an array), the number of valid entries starting at that index (may be zero), and the number of invalid entries following those valid entries. The valid entries, 16 bytes each, follow the header. The invalid entries are not explicitly represented. Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: fix documentation] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 30 10月, 2012 1 次提交
-
-
由 Alexander Graf 提交于
All user space offloaded instruction emulation needs to reenter kvm to produce consistent state again. Fix the section in the documentation to mention all of them. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 11 10月, 2012 1 次提交
-
-
由 Cornelia Huck 提交于
Signed-off-by: NCornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 06 10月, 2012 5 次提交
-
-
由 Paul Mackerras 提交于
The PAPR paravirtualization interface lets guests register three different types of per-vCPU buffer areas in its memory for communication with the hypervisor. These are called virtual processor areas (VPAs). Currently the hypercalls to register and unregister VPAs are handled by KVM in the kernel, and userspace has no way to know about or save and restore these registrations across a migration. This adds "register" codes for these three areas that userspace can use with the KVM_GET/SET_ONE_REG ioctls to see what addresses have been registered, and to register or unregister them. This will be needed for guest hibernation and migration, and is also needed so that userspace can unregister them on reset (otherwise we corrupt guest memory after reboot by writing to the VPAs registered by the previous kernel). The "register" for the VPA is a 64-bit value containing the address, since the length of the VPA is fixed. The "registers" for the SLB shadow buffer and dispatch trace log (DTL) are 128 bits long, consisting of the guest physical address in the high (first) 64 bits and the length in the low 64 bits. This also fixes a bug where we were calling init_vpa unconditionally, leading to an oops when unregistering the VPA. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This enables userspace to get and set all the guest floating-point state using the KVM_[GS]ET_ONE_REG ioctls. The floating-point state includes all of the traditional floating-point registers and the FPSCR (floating point status/control register), all the VMX/Altivec vector registers and the VSCR (vector status/control register), and on POWER7, the vector-scalar registers (note that each FP register is the high-order half of the corresponding VSR). Most of these are implemented in common Book 3S code, except for VSX on POWER7. Because HV and PR differ in how they store the FP and VSX registers on POWER7, the code for these cases is not common. On POWER7, the FP registers are the upper halves of the VSX registers vsr0 - vsr31. PR KVM stores vsr0 - vsr31 in two halves, with the upper halves in the arch.fpr[] array and the lower halves in the arch.vsr[] array, whereas HV KVM on POWER7 stores the whole VSX register in arch.vsr[]. Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: fix whitespace, vsx compilation] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This enables userspace to get and set various SPRs (special-purpose registers) using the KVM_[GS]ET_ONE_REG ioctls. With this, userspace can get and set all the SPRs that are part of the guest state, either through the KVM_[GS]ET_REGS ioctls, the KVM_[GS]ET_SREGS ioctls, or the KVM_[GS]ET_ONE_REG ioctls. The SPRs that are added here are: - DABR: Data address breakpoint register - DSCR: Data stream control register - PURR: Processor utilization of resources register - SPURR: Scaled PURR - DAR: Data address register - DSISR: Data storage interrupt status register - AMR: Authority mask register - UAMOR: User authority mask override register - MMCR0, MMCR1, MMCRA: Performance monitor unit control registers - PMC1..PMC8: Performance monitor unit counter registers In order to reduce code duplication between PR and HV KVM code, this moves the kvm_vcpu_ioctl_[gs]et_one_reg functions into book3s.c and centralizes the copying between user and kernel space there. The registers that are handled differently between PR and HV, and those that exist only in one flavor, are handled in kvmppc_[gs]et_one_reg() functions that are specific to each flavor. Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: minimal style fixes] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Bharat Bhushan 提交于
Patch to access the debug registers (IACx/DACx) using ONE_REG api was sent earlier. But that missed the respective documentation. Also corrected the index number referencing in section 4.69 Signed-off-by: NBharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Liu Yu-B13201 提交于
And add a new flag definition in kvm_ppc_pvinfo to indicate whether the host supports the EV_IDLE hcall. Signed-off-by: NLiu Yu <yu.liu@freescale.com> [stuart.yoder@freescale.com: cleanup,fixes for conditions allowing idle] Signed-off-by: NStuart Yoder <stuart.yoder@freescale.com> [agraf: fix typo] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 23 9月, 2012 1 次提交
-
-
由 Alex Williamson 提交于
To emulate level triggered interrupts, add a resample option to KVM_IRQFD. When specified, a new resamplefd is provided that notifies the user when the irqchip has been resampled by the VM. This may, for instance, indicate an EOI. Also in this mode, posting of an interrupt through an irqfd only asserts the interrupt. On resampling, the interrupt is automatically de-asserted prior to user notification. This enables level triggered interrupts to be posted and re-enabled from vfio with no userspace intervention. All resampling irqfds can make use of a single irq source ID, so we reserve a new one for this interface. Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 09 9月, 2012 1 次提交
-
-
由 Jan Kiszka 提交于
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 22 8月, 2012 1 次提交
-
-
由 Xiao Guangrong 提交于
In current code, if we map a readonly memory space from host to guest and the page is not currently mapped in the host, we will get a fault pfn and async is not allowed, then the vm will crash We introduce readonly memory region to map ROM/ROMD to the guest, read access is happy for readonly memslot, write access on readonly memslot will cause KVM_EXIT_MMIO exit Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 03 7月, 2012 1 次提交
-
-
由 Alex Williamson 提交于
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 30 5月, 2012 1 次提交
-
-
由 Paul Mackerras 提交于
This adds a new ioctl to enable userspace to control the size of the guest hashed page table (HPT) and to clear it out when resetting the guest. The KVM_PPC_ALLOCATE_HTAB ioctl is a VM ioctl and takes as its parameter a pointer to a u32 containing the desired order of the HPT (log base 2 of the size in bytes), which is updated on successful return to the actual order of the HPT which was allocated. There must be no vcpus running at the time of this ioctl. To enforce this, we now keep a count of the number of vcpus running in kvm->arch.vcpus_running. If the ioctl is called when a HPT has already been allocated, we don't reallocate the HPT but just clear it out. We first clear the kvm->arch.rma_setup_done flag, which has two effects: (a) since we hold the kvm->lock mutex, it will prevent any vcpus from starting to run until we're done, and (b) it means that the first vcpu to run after we're done will re-establish the VRMA if necessary. If userspace doesn't call this ioctl before running the first vcpu, the kernel will allocate a default-sized HPT at that point. We do it then rather than when creating the VM, as the code did previously, so that userspace has a chance to do the ioctl if it wants. When allocating the HPT, we can allocate either from the kernel page allocator, or from the preallocated pool. If userspace is asking for a different size from the preallocated HPTs, we first try to allocate using the kernel page allocator. Then we try to allocate from the preallocated pool, and then if that fails, we try allocating decreasing sizes from the kernel page allocator, down to the minimum size allowed (256kB). Note that the kernel page allocator limits allocations to 1 << CONFIG_FORCE_MAX_ZONEORDER pages, which by default corresponds to 16MB (on 64-bit powerpc, at least). Signed-off-by: NPaul Mackerras <paulus@samba.org> [agraf: fix module compilation] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 06 5月, 2012 1 次提交
-
-
由 Benjamin Herrenschmidt 提交于
This is necessary for qemu to be able to pass the right information to the guest, such as the supported page sizes and corresponding encodings in the SLB and hash table, which can vary depending on the processor type, the type of KVM used (PR vs HV) and the version of KVM Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> [agraf: fix compilation on hv, adjust for newer ioctl numbers] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 28 4月, 2012 3 次提交
-
-
由 Jan Kiszka 提交于
We can't run PIT IRQ injection work in the interrupt context of the host timer. This would allow the user to influence the handler complexity by asking for a broadcast to a large number of VCPUs. Therefore, this work was pushed into workqueue context in 9d244caf2e. However, this prevents prioritizing the PIT injection over other task as workqueues share kernel threads. This replaces the workqueue with a kthread worker and gives that thread a name in the format "kvm-pit/<owner-process-pid>". That allows to identify and adjust the kthread priority according to the VM process parameters. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
由 Jan Kiszka 提交于
Add descriptions for KVM_CREATE_PIT2 and KVM_GET/SET_PIT2. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
由 Jan Kiszka 提交于
This helps to identify sections and it also fixes the numbering from 4.54 to 4.61. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 24 4月, 2012 1 次提交
-
-
由 Jan Kiszka 提交于
Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated "on the fly" by user space, IRQ routes are a limited resource that user space has to manage carefully. By providing a direct injection path, we can both avoid using up limited resources and simplify the necessary steps for user land. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 08 4月, 2012 1 次提交
-
-
由 Eric B Munson 提交于
Now that we have a flag that will tell the guest it was suspended, create an interface for that communication using a KVM ioctl. Signed-off-by: NEric B Munson <emunson@mgebm.net> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 08 3月, 2012 1 次提交
-
-
由 Jan Kiszka 提交于
PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Acked-by: NAlex Williamson <alex.williamson@redhat.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 05 3月, 2012 9 次提交
-
-
由 Alexander Graf 提交于
Until now, we always set HIOR based on the PVR, but this is just wrong. Instead, we should be setting HIOR explicitly, so user space can decide what the initial HIOR value is - just like on real hardware. We keep the old PVR based way around for backwards compatibility, but once user space uses the SET_ONE_REG based method, we drop the PVR logic. Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Alexander Graf 提交于
Right now we transfer a static struct every time we want to get or set registers. Unfortunately, over time we realize that there are more of these than we thought of before and the extensibility and flexibility of transferring a full struct every time is limited. So this is a new approach to the problem. With these new ioctls, we can get and set a single register that is identified by an ID. This allows for very precise and limited transmittal of data. When we later realize that it's a better idea to shove over multiple registers at once, we can reuse most of the infrastructure and simply implement a GET_MANY_REGS / SET_MANY_REGS interface. Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Scott Wood 提交于
This implements a shared-memory API for giving host userspace access to the guest's TLB. Signed-off-by: NScott Wood <scottwood@freescale.com> Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Christian Borntraeger 提交于
On some cpus the overhead for virtualization instructions is in the same range as a system call. Having to call multiple ioctls to get set registers will make certain userspace handled exits more expensive than necessary. Lets provide a section in kvm_run that works as a shared save area for guest registers. We also provide two 64bit flags fields (architecture specific), that will specify 1. which parts of these fields are valid. 2. which registers were modified by userspace Each bit for these flag fields will define a group of registers (like general purpose) or a single register. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Carsten Otte 提交于
This patch allows the user to fault in pages on a virtual cpus address space for user controlled virtual machines. Typically this is superfluous because userspace can just create a mapping and let the kernel's page fault logic take are of it. There is one exception: SIE won't start if the lowcore is not present. Normally the kernel takes care of this [handle_validity() in arch/s390/kvm/intercept.c] but since the kernel does not handle intercepts for user controlled virtual machines, userspace needs to be able to handle this condition. Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Carsten Otte 提交于
This patch exports the s390 SIE hardware control block to userspace via the mapping of the vcpu file descriptor. In order to do so, a new arch callback named kvm_arch_vcpu_fault is introduced for all architectures. It allows to map architecture specific pages. Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Carsten Otte 提交于
This patch introduces a new exit reason in the kvm_run structure named KVM_EXIT_S390_UCONTROL. This exit indicates, that a virtual cpu has regognized a fault on the host page table. The idea is that userspace can handle this fault by mapping memory at the fault location into the cpu's address space and then continue to run the virtual cpu. Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Carsten Otte 提交于
This patch introduces two ioctls for virtual cpus, that are only valid for kernel virtual machines that are controlled by userspace. Each virtual cpu has its individual address space in this mode of operation, and each address space is backed by the gmap implementation just like the address space for regular KVM guests. KVM_S390_UCAS_MAP allows to map a part of the user's virtual address space to the vcpu. Starting offset and length in both the user and the vcpu address space need to be aligned to 1M. KVM_S390_UCAS_UNMAP can be used to unmap a range of memory from a virtual cpu in a similar way. Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Carsten Otte 提交于
This patch introduces a new config option for user controlled kernel virtual machines. It introduces a parameter to KVM_CREATE_VM that allows to set bits that alter the capabilities of the newly created virtual machine. The parameter is passed to kvm_arch_init_vm for all architectures. The only valid modifier bit for now is KVM_VM_S390_UCONTROL. This requires CAP_SYS_ADMIN privileges and creates a user controlled virtual machine on s390 architectures. Signed-off-by: NCarsten Otte <cotte@de.ibm.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 27 12月, 2011 1 次提交
-
-
由 Avi Kivity 提交于
Signed-off-by: NAvi Kivity <avi@redhat.com>
-
- 26 12月, 2011 2 次提交
-
-
由 Jan Kiszka 提交于
Unlike all of the other cpuid bits, the TSC deadline timer bit is set unconditionally, regardless of what userspace wants. This is broken in several ways: - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC deadline timer feature, a guest that uses the feature will break - live migration to older host kernels that don't support the TSC deadline timer will cause the feature to be pulled from under the guest's feet; breaking it - guests that are broken wrt the feature will fail. Fix by not enabling the feature automatically; instead report it to userspace. Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not KVM_GET_SUPPORTED_CPUID. Fixes the Illumos guest kernel, which uses the TSC deadline timer feature. [avi: add the KVM_CAP + documentation] Reported-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com> Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com> Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Alex Williamson 提交于
Only allow KVM device assignment to attach to devices which: - Are not bridges - Have BAR resources (assume others are special devices) - The user has permissions to use Assigning a bridge is a configuration error, it's not supported, and typically doesn't result in the behavior the user is expecting anyway. Devices without BAR resources are typically chipset components that also don't have host drivers. We don't want users to hold such devices captive or cause system problems by fencing them off into an iommu domain. We determine "permission to use" by testing whether the user has access to the PCI sysfs resource files. By default a normal user will not have access to these files, so it provides a good indication that an administration agent has granted the user access to the device. [Yang Bai: add missing #include] [avi: fix comment style] Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NYang Bai <hamo.by@gmail.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 25 12月, 2011 1 次提交
-
-
由 Alex Williamson 提交于
This option has no users and it exposes a security hole that we can allow devices to be assigned without iommu protection. Make KVM_DEV_ASSIGN_ENABLE_IOMMU a mandatory option. Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 26 9月, 2011 3 次提交
-
-
由 Alexander Graf 提交于
We have an ioctl that enables capabilities individually, but no description on what exactly happens when we enable a capability using this ioctl. This patch adds documentation for capability enabling in a new section of the API documentation. Signed-off-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Avi Kivity 提交于
Commit 371fefd6 lost a doc hunk somehow, restore it. Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Sasha Levin 提交于
The patch raises the hard limit of VCPU count to 254. This will allow developers to easily work on scalability and will allow users to test high VCPU setups easily without patching the kernel. To prevent possible issues with current setups, KVM_CAP_NR_VCPUS now returns the recommended VCPU limit (which is still 64) - this should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS returns the hard limit which is now 254. Cc: Avi Kivity <avi@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Pekka Enberg <penberg@kernel.org> Suggested-by: NPekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NSasha Levin <levinsasha928@gmail.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 12 7月, 2011 2 次提交
-
-
由 Paul Mackerras 提交于
This adds infrastructure which will be needed to allow book3s_hv KVM to run on older POWER processors, including PPC970, which don't support the Virtual Real Mode Area (VRMA) facility, but only the Real Mode Offset (RMO) facility. These processors require a physically contiguous, aligned area of memory for each guest. When the guest does an access in real mode (MMU off), the address is compared against a limit value, and if it is lower, the address is ORed with an offset value (from the Real Mode Offset Register (RMOR)) and the result becomes the real address for the access. The size of the RMA has to be one of a set of supported values, which usually includes 64MB, 128MB, 256MB and some larger powers of 2. Since we are unlikely to be able to allocate 64MB or more of physically contiguous memory after the kernel has been running for a while, we allocate a pool of RMAs at boot time using the bootmem allocator. The size and number of the RMAs can be set using the kvm_rma_size=xx and kvm_rma_count=xx kernel command line options. KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability of the pool of preallocated RMAs. The capability value is 1 if the processor can use an RMA but doesn't require one (because it supports the VRMA facility), or 2 if the processor requires an RMA for each guest. This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the pool and returns a file descriptor which can be used to map the RMA. It also returns the size of the RMA in the argument structure. Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION ioctl calls from userspace. To cope with this, we now preallocate the kvm->arch.ram_pginfo array when the VM is created with a size sufficient for up to 64GB of guest memory. Subsequently we will get rid of this array and use memory associated with each memslot instead. This moves most of the code that translates the user addresses into host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level to kvmppc_core_prepare_memory_region. Also, instead of having to look up the VMA for each page in order to check the page size, we now check that the pages we get are compound pages of 16MB. However, if we are adding memory that is mapped to an RMA, we don't bother with calling get_user_pages_fast and instead just offset from the base pfn for the RMA. Typically the RMA gets added after vcpus are created, which makes it inconvenient to have the LPCR (logical partition control register) value in the vcpu->arch struct, since the LPCR controls whether the processor uses RMA or VRMA for the guest. This moves the LPCR value into the kvm->arch struct and arranges for the MER (mediated external request) bit, which is the only bit that varies between vcpus, to be set in assembly code when going into the guest if there is a pending external interrupt request. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This lifts the restriction that book3s_hv guests can only run one hardware thread per core, and allows them to use up to 4 threads per core on POWER7. The host still has to run single-threaded. This capability is advertised to qemu through a new KVM_CAP_PPC_SMT capability. The return value of the ioctl querying this capability is the number of vcpus per virtual CPU core (vcore), currently 4. To use this, the host kernel should be booted with all threads active, and then all the secondary threads should be offlined. This will put the secondary threads into nap mode. KVM will then wake them from nap mode and use them for running guest code (while they are still offline). To wake the secondary threads, we send them an IPI using a new xics_wake_cpu() function, implemented in arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage we assume that the platform has a XICS interrupt controller and we are using icp-native.c to drive it. Since the woken thread will need to acknowledge and clear the IPI, we also export the base physical address of the XICS registers using kvmppc_set_xics_phys() for use in the low-level KVM book3s code. When a vcpu is created, it is assigned to a virtual CPU core. The vcore number is obtained by dividing the vcpu number by the number of threads per core in the host. This number is exported to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes to run the guest in single-threaded mode, it should make all vcpu numbers be multiples of the number of threads per core. We distinguish three states of a vcpu: runnable (i.e., ready to execute the guest), blocked (that is, idle), and busy in host. We currently implement a policy that the vcore can run only when all its threads are runnable or blocked. This way, if a vcpu needs to execute elsewhere in the kernel or in qemu, it can do so without being starved of CPU by the other vcpus. When a vcore starts to run, it executes in the context of one of the vcpu threads. The other vcpu threads all go to sleep and stay asleep until something happens requiring the vcpu thread to return to qemu, or to wake up to run the vcore (this can happen when another vcpu thread goes from busy in host state to blocked). It can happen that a vcpu goes from blocked to runnable state (e.g. because of an interrupt), and the vcore it belongs to is already running. In that case it can start to run immediately as long as the none of the vcpus in the vcore have started to exit the guest. We send the next free thread in the vcore an IPI to get it to start to execute the guest. It synchronizes with the other threads via the vcore->entry_exit_count field to make sure that it doesn't go into the guest if the other vcpus are exiting by the time that it is ready to actually enter the guest. Note that there is no fixed relationship between the hardware thread number and the vcpu number. Hardware threads are assigned to vcpus as they become runnable, so we will always use the lower-numbered hardware threads in preference to higher-numbered threads if not all the vcpus in the vcore are runnable, regardless of which vcpus are runnable. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-