- 28 5月, 2015 1 次提交
-
-
由 Paolo Bonzini 提交于
This lets the function access the new memory slot without going through kvm_memslots and id_to_memslot. It will simplify the code when more than one address space will be supported. Unfortunately, the "const"ness of the new argument must be casted away in two places. Fixing KVM to accept const struct kvm_memory_slot pointers would require modifications in pretty much all architectures, and is left for later. Reviewed-by: NRadim Krcmar <rkrcmar@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 5月, 2015 2 次提交
-
-
由 Paolo Bonzini 提交于
Architecture-specific helpers are not supposed to muck with struct kvm_userspace_memory_region contents. Add const to enforce this. In order to eliminate the only write in __kvm_set_memory_region, the cleaning of deleted slots is pulled up from update_memslots to __kvm_set_memory_region. Reviewed-by: NTakuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Reviewed-by: NRadim Krcmar <rkrcmar@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
kvm_memslots provides lockdep checking. Use it consistently instead of explicit dereferencing of kvm->memslots. Reviewed-by: NRadim Krcmar <rkrcmar@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 10 5月, 2015 1 次提交
-
-
由 Paul Mackerras 提交于
This fixes a regression introduced in commit 25fedfca, "KVM: PPC: Book3S HV: Move vcore preemption point up into kvmppc_run_vcpu", which leads to a user-triggerable oops. In the case where we try to run a vcore on a physical core that is not in single-threaded mode, or the vcore has too many threads for the physical core, we iterate the list of runnable vcpus to make each one return an EBUSY error to userspace. Since this involves taking each vcpu off the runnable_threads list for the vcore, we need to use list_for_each_entry_safe rather than list_for_each_entry to traverse the list. Otherwise the kernel will crash with an oops message like this: Unable to handle kernel paging request for data at address 0x000fff88 Faulting instruction address: 0xd00000001e635dc8 Oops: Kernel access of bad area, sig: 11 [#2] SMP NR_CPUS=1024 NUMA PowerNV ... CPU: 48 PID: 91256 Comm: qemu-system-ppc Tainted: G D 3.18.0 #1 task: c00000274e507500 ti: c0000027d1924000 task.ti: c0000027d1924000 NIP: d00000001e635dc8 LR: d00000001e635df8 CTR: c00000000011ba50 REGS: c0000027d19275b0 TRAP: 0300 Tainted: G D (3.18.0) MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 22002824 XER: 00000000 CFAR: c000000000008468 DAR: 00000000000fff88 DSISR: 40000000 SOFTE: 1 GPR00: d00000001e635df8 c0000027d1927830 d00000001e64c850 0000000000000001 GPR04: 0000000000000001 0000000000000001 0000000000000000 0000000000000000 GPR08: 0000000000200200 0000000000000000 0000000000000000 d00000001e63e588 GPR12: 0000000000002200 c000000007dbc800 c000000fc7800000 000000000000000a GPR16: fffffffffffffffc c000000fd5439690 c000000fc7801c98 0000000000000001 GPR20: 0000000000000003 c0000027d1927aa8 c000000fd543b348 c000000fd543b350 GPR24: 0000000000000000 c000000fa57f0000 0000000000000030 0000000000000000 GPR28: fffffffffffffff0 c000000fd543b328 00000000000fe468 c000000fd543b300 NIP [d00000001e635dc8] kvmppc_run_core+0x198/0x17c0 [kvm_hv] LR [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] Call Trace: [c0000027d1927830] [d00000001e635df8] kvmppc_run_core+0x1c8/0x17c0 [kvm_hv] (unreliable) [c0000027d1927a30] [d00000001e638350] kvmppc_vcpu_run_hv+0x5b0/0xdd0 [kvm_hv] [c0000027d1927b70] [d00000001e510504] kvmppc_vcpu_run+0x44/0x60 [kvm] [c0000027d1927ba0] [d00000001e50d4a4] kvm_arch_vcpu_ioctl_run+0x64/0x170 [kvm] [c0000027d1927be0] [d00000001e504be8] kvm_vcpu_ioctl+0x5e8/0x7a0 [kvm] [c0000027d1927d40] [c0000000002d6720] do_vfs_ioctl+0x490/0x780 [c0000027d1927de0] [c0000000002d6ae4] SyS_ioctl+0xd4/0xf0 [c0000027d1927e30] [c000000000009358] syscall_exit+0x0/0x98 Instruction dump: 60000000 60420000 387e1b30 38800003 38a00001 38c00000 480087d9 e8410018 ebde1c98 7fbdf040 3bdee368 419e0048 <813e1b20> 939e1b18 2f890001 409effcc ---[ end trace 8cdf50251cca6680 ]--- Fixes: 25fedfcaSigned-off-by: NPaul Mackerras <paulus@samba.org> Reviewed-by: NAlexander Graf <agraf@suse.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 4月, 2015 9 次提交
-
-
由 Paul Mackerras 提交于
This uses msgsnd where possible for signalling other threads within the same core on POWER8 systems, rather than IPIs through the XICS interrupt controller. This includes waking secondary threads to run the guest, the interrupts generated by the virtual XICS, and the interrupts to bring the other threads out of the guest when exiting. Aggregated statistics from debugfs across vcpus for a guest with 32 vcpus, 8 threads/vcore, running on a POWER8, show this before the change: rm_entry: 3387.6ns (228 - 86600, 1008969 samples) rm_exit: 4561.5ns (12 - 3477452, 1009402 samples) rm_intr: 1660.0ns (12 - 553050, 3600051 samples) and this after the change: rm_entry: 3060.1ns (212 - 65138, 953873 samples) rm_exit: 4244.1ns (12 - 9693408, 954331 samples) rm_intr: 1342.3ns (12 - 1104718, 3405326 samples) for a test of booting Fedora 20 big-endian to the login prompt. The time taken for a H_PROD hcall (which is handled in the host kernel) went down from about 35 microseconds to about 16 microseconds with this change. The noinline added to kvmppc_run_core turned out to be necessary for good performance, at least with gcc 4.9.2 as packaged with Fedora 21 and a little-endian POWER8 host. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently, the entry_exit_count field in the kvmppc_vcore struct contains two 8-bit counts, one of the threads that have started entering the guest, and one of the threads that have started exiting the guest. This changes it to an entry_exit_map field which contains two bitmaps of 8 bits each. The advantage of doing this is that it gives us a bitmap of which threads need to be signalled when exiting the guest. That means that we no longer need to use the trick of setting the HDEC to 0 to pull the other threads out of the guest, which led in some cases to a spurious HDEC interrupt on the next guest entry. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
We can tell when a secondary thread has finished running a guest by the fact that it clears its kvm_hstate.kvm_vcpu pointer, so there is no real need for the nap_count field in the kvmppc_vcore struct. This changes kvmppc_wait_for_nap to poll the kvm_hstate.kvm_vcpu pointers of the secondary threads rather than polling vc->nap_count. Besides reducing the size of the kvmppc_vcore struct by 8 bytes, this also means that we can tell which secondary threads have got stuck and thus print a more informative error message. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Rather than calling cond_resched() in kvmppc_run_core() before doing the post-processing for the vcpus that we have just run (that is, calling kvmppc_handle_exit_hv(), kvmppc_set_timer(), etc.), we now do that post-processing before calling cond_resched(), and that post- processing is moved out into its own function, post_guest_process(). The reschedule point is now in kvmppc_run_vcpu() and we define a new vcore state, VCORE_PREEMPT, to indicate that that the vcore's runner task is runnable but not running. (Doing the reschedule with the vcore in VCORE_INACTIVE state would be bad because there are potentially other vcpus waiting for the runner in kvmppc_wait_for_exec() which then wouldn't get woken up.) Also, we make use of the handy cond_resched_lock() function, which unlocks and relocks vc->lock for us around the reschedule. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Previously, if kvmppc_run_core() was running a VCPU that needed a VPA update (i.e. one of its 3 virtual processor areas needed to be pinned in memory so the host real mode code can update it on guest entry and exit), we would drop the vcore lock and do the update there and then. Future changes will make it inconvenient to drop the lock, so instead we now remove it from the list of runnable VCPUs and wake up its VCPU task. This will have the effect that the VCPU task will exit kvmppc_run_vcpu(), go around the do loop in kvmppc_vcpu_run_hv(), and re-enter kvmppc_run_vcpu(), whereupon it will do the necessary call to kvmppc_update_vpas() and then rejoin the vcore. The one complication is that the runner VCPU (whose VCPU task is the current task) might be one of the ones that gets removed from the runnable list. In that case we just return from kvmppc_run_core() and let the code in kvmppc_run_vcpu() wake up another VCPU task to be the runner if necessary. This all means that the VCORE_STARTING state is no longer used, so we remove it. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This reads the timebase at various points in the real-mode guest entry/exit code and uses that to accumulate total, minimum and maximum time spent in those parts of the code. Currently these times are accumulated per vcpu in 5 parts of the code: * rm_entry - time taken from the start of kvmppc_hv_entry() until just before entering the guest. * rm_intr - time from when we take a hypervisor interrupt in the guest until we either re-enter the guest or decide to exit to the host. This includes time spent handling hcalls in real mode. * rm_exit - time from when we decide to exit the guest until the return from kvmppc_hv_entry(). * guest - time spend in the guest * cede - time spent napping in real mode due to an H_CEDE hcall while other threads in the same vcore are active. These times are exposed in debugfs in a directory per vcpu that contains a file called "timings". This file contains one line for each of the 5 timings above, with the name followed by a colon and 4 numbers, which are the count (number of times the code has been executed), the total time, the minimum time, and the maximum time, all in nanoseconds. The overhead of the extra code amounts to about 30ns for an hcall that is handled in real mode (e.g. H_SET_DABR), which is about 25%. Since production environments may not wish to incur this overhead, the new code is conditional on a new config symbol, CONFIG_KVM_BOOK3S_HV_EXIT_TIMING. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This creates a debugfs directory for each HV guest (assuming debugfs is enabled in the kernel config), and within that directory, a file by which the contents of the guest's HPT (hashed page table) can be read. The directory is named vmnnnn, where nnnn is the PID of the process that created the guest. The file is named "htab". This is intended to help in debugging problems in the host's management of guest memory. The contents of the file consist of a series of lines like this: 3f48 4000d032bf003505 0000000bd7ff1196 00000003b5c71196 The first field is the index of the entry in the HPT, the second and third are the HPT entry, so the third entry contains the real page number that is mapped by the entry if the entry's valid bit is set. The fourth field is the guest's view of the second doubleword of the entry, so it contains the guest physical address. (The format of the second through fourth fields are described in the Power ISA and also in arch/powerpc/include/asm/mmu-hash64.h.) Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
We don't support real-mode areas now that 970 support is removed. Remove the remaining details of rma from the code. Also rename rma_setup_done to hpte_setup_done to better reflect the changes. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 David Gibson 提交于
On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: NDavid Gibson <david@gibson.dropbear.id.au> [agraf: fix compilation] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 20 3月, 2015 2 次提交
-
-
由 Paul Mackerras 提交于
The VPA (virtual processor area) is defined by PAPR and is therefore big-endian, so we need a be32_to_cpu when reading it in kvmppc_get_yield_count(). Without this, H_CONFER always fails on a little-endian host, causing SMP guests to waste time spinning on spinlocks. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently, kvmppc_set_lpcr() has a spinlock around the whole function, and inside that does mutex_lock(&kvm->lock). It is not permitted to take a mutex while holding a spinlock, because the mutex_lock might call schedule(). In addition, this causes lockdep to warn about a lock ordering issue: ====================================================== [ INFO: possible circular locking dependency detected ] 3.18.0-kvm-04645-gdfea862-dirty #131 Not tainted ------------------------------------------------------- qemu-system-ppc/8179 is trying to acquire lock: (&kvm->lock){+.+.+.}, at: [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] but task is already holding lock: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&(&vcore->lock)->rlock){+.+...}: [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570 [<d00000000ecc7a14>] .kvmppc_vcpu_run_hv+0xc4/0xe40 [kvm_hv] [<d00000000eb9f5cc>] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [<d00000000eb9cb24>] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [<d00000000eb94478>] .kvm_vcpu_ioctl+0x4a8/0x7b0 [kvm] [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770 [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0 [<c000000000009264>] syscall_exit+0x0/0x98 -> #0 (&kvm->lock){+.+.+.}: [<c0000000000ff28c>] .lock_acquire+0xcc/0x1a0 [<c000000000b3c120>] .mutex_lock_nested+0x80/0x570 [<d00000000ecc1f54>] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] [<d00000000ecc510c>] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv] [<d00000000eb9f234>] .kvmppc_set_one_reg+0x44/0x330 [kvm] [<d00000000eb9c9dc>] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm] [<d00000000eb9ced4>] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm] [<d00000000eb940b0>] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm] [<c00000000026cbb4>] .do_vfs_ioctl+0x444/0x770 [<c00000000026cfa4>] .SyS_ioctl+0xc4/0xe0 [<c000000000009264>] syscall_exit+0x0/0x98 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&(&vcore->lock)->rlock); lock(&kvm->lock); lock(&(&vcore->lock)->rlock); lock(&kvm->lock); *** DEADLOCK *** 2 locks held by qemu-system-ppc/8179: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f18>] .vcpu_load+0x28/0x90 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecc1ea0>] .kvmppc_set_lpcr+0x40/0x1c0 [kvm_hv] stack backtrace: CPU: 4 PID: 8179 Comm: qemu-system-ppc Not tainted 3.18.0-kvm-04645-gdfea862-dirty #131 Call Trace: [c000001a66c0f310] [c000000000b486ac] .dump_stack+0x88/0xb4 (unreliable) [c000001a66c0f390] [c0000000000f8bec] .print_circular_bug+0x27c/0x3d0 [c000001a66c0f440] [c0000000000fe9e8] .__lock_acquire+0x2028/0x2190 [c000001a66c0f5d0] [c0000000000ff28c] .lock_acquire+0xcc/0x1a0 [c000001a66c0f6a0] [c000000000b3c120] .mutex_lock_nested+0x80/0x570 [c000001a66c0f7c0] [d00000000ecc1f54] .kvmppc_set_lpcr+0xf4/0x1c0 [kvm_hv] [c000001a66c0f860] [d00000000ecc510c] .kvmppc_set_one_reg_hv+0x4dc/0x990 [kvm_hv] [c000001a66c0f8d0] [d00000000eb9f234] .kvmppc_set_one_reg+0x44/0x330 [kvm] [c000001a66c0f960] [d00000000eb9c9dc] .kvm_vcpu_ioctl_set_one_reg+0x5c/0x150 [kvm] [c000001a66c0f9f0] [d00000000eb9ced4] .kvm_arch_vcpu_ioctl+0x214/0x2c0 [kvm] [c000001a66c0faf0] [d00000000eb940b0] .kvm_vcpu_ioctl+0xe0/0x7b0 [kvm] [c000001a66c0fcb0] [c00000000026cbb4] .do_vfs_ioctl+0x444/0x770 [c000001a66c0fd90] [c00000000026cfa4] .SyS_ioctl+0xc4/0xe0 [c000001a66c0fe30] [c000000000009264] syscall_exit+0x0/0x98 This fixes it by moving the mutex_lock()/mutex_unlock() pair outside the spin-locked region. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 17 12月, 2014 5 次提交
-
-
由 Sam Bobroff 提交于
Currently the H_CONFER hcall is implemented in kernel virtual mode, meaning that whenever a guest thread does an H_CONFER, all the threads in that virtual core have to exit the guest. This is bad for performance because it interrupts the other threads even if they are doing useful work. The H_CONFER hcall is called by a guest VCPU when it is spinning on a spinlock and it detects that the spinlock is held by a guest VCPU that is currently not running on a physical CPU. The idea is to give this VCPU's time slice to the holder VCPU so that it can make progress towards releasing the lock. To avoid having the other threads exit the guest unnecessarily, we add a real-mode implementation of H_CONFER that checks whether the other threads are doing anything. If all the other threads are idle (i.e. in H_CEDE) or trying to confer (i.e. in H_CONFER), it returns H_TOO_HARD which causes a guest exit and allows the H_CONFER to be handled in virtual mode. Otherwise it spins for a short time (up to 10 microseconds) to give other threads the chance to observe that this thread is trying to confer. The spin loop also terminates when any thread exits the guest or when all other threads are idle or trying to confer. If the timeout is reached, the H_CONFER returns H_SUCCESS. In this case the guest VCPU will recheck the spinlock word and most likely call H_CONFER again. This also improves the implementation of the H_CONFER virtual mode handler. If the VCPU is part of a virtual core (vcore) which is runnable, there will be a 'runner' VCPU which has taken responsibility for running the vcore. In this case we yield to the runner VCPU rather than the target VCPU. We also introduce a check on the target VCPU's yield count: if it differs from the yield count passed to H_CONFER, the target VCPU has run since H_CONFER was called and may have already released the lock. This check is required by PAPR. Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
There are two ways in which a guest instruction can be obtained from the guest in the guest exit code in book3s_hv_rmhandlers.S. If the exit was caused by a Hypervisor Emulation interrupt (i.e. an illegal instruction), the offending instruction is in the HEIR register (Hypervisor Emulation Instruction Register). If the exit was caused by a load or store to an emulated MMIO device, we load the instruction from the guest by turning data relocation on and loading the instruction with an lwz instruction. Unfortunately, in the case where the guest has opposite endianness to the host, these two methods give results of different endianness, but both get put into vcpu->arch.last_inst. The HEIR value has been loaded using guest endianness, whereas the lwz will load the instruction using host endianness. The rest of the code that uses vcpu->arch.last_inst assumes it was loaded using host endianness. To fix this, we define a new vcpu field to store the HEIR value. Then, in kvmppc_handle_exit_hv(), we transfer the value from this new field to vcpu->arch.last_inst, doing a byte-swap if the guest and host endianness differ. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This removes the code that was added to enable HV KVM to work on PPC970 processors. The PPC970 is an old CPU that doesn't support virtualizing guest memory. Removing PPC970 support also lets us remove the code for allocating and managing contiguous real-mode areas, the code for the !kvm->arch.using_mmu_notifiers case, the code for pinning pages of guest memory when first accessed and keeping track of which pages have been pinned, and the code for handling H_ENTER hypercalls in virtual mode. Book3S HV KVM is now supported only on POWER7 and POWER8 processors. The KVM_CAP_PPC_RMA capability now always returns 0. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Suresh E. Warrier 提交于
This patch adds trace points in the guest entry and exit code and also for exceptions handled by the host in kernel mode - hypercalls and page faults. The new events are added to /sys/kernel/debug/tracing/events under a new subsystem called kvm_hv. Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Currently the calculations of stolen time for PPC Book3S HV guests uses fields in both the vcpu struct and the kvmppc_vcore struct. The fields in the kvmppc_vcore struct are protected by the vcpu->arch.tbacct_lock of the vcpu that has taken responsibility for running the virtual core. This works correctly but confuses lockdep, because it sees that the code takes the tbacct_lock for a vcpu in kvmppc_remove_runnable() and then takes another vcpu's tbacct_lock in vcore_stolen_time(), and it thinks there is a possibility of deadlock, causing it to print reports like this: ============================================= [ INFO: possible recursive locking detected ] 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Not tainted --------------------------------------------- qemu-system-ppc/6188 is trying to acquire lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb1fe8>] .vcore_stolen_time+0x48/0xd0 [kvm_hv] but task is already holding lock: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&(&vcpu->arch.tbacct_lock)->rlock); lock(&(&vcpu->arch.tbacct_lock)->rlock); *** DEADLOCK *** May be due to missing lock nesting notation 3 locks held by qemu-system-ppc/6188: #0: (&vcpu->mutex){+.+.+.}, at: [<d00000000eb93f98>] .vcpu_load+0x28/0xe0 [kvm] #1: (&(&vcore->lock)->rlock){+.+...}, at: [<d00000000ecb41b0>] .kvmppc_vcpu_run_hv+0x530/0x1530 [kvm_hv] #2: (&(&vcpu->arch.tbacct_lock)->rlock){......}, at: [<d00000000ecb25a0>] .kvmppc_remove_runnable.part.3+0x30/0xd0 [kvm_hv] stack backtrace: CPU: 40 PID: 6188 Comm: qemu-system-ppc Not tainted 3.18.0-rc7-kvm-00016-g8db4bc6 #89 Call Trace: [c000000b2754f3f0] [c000000000b31b6c] .dump_stack+0x88/0xb4 (unreliable) [c000000b2754f470] [c0000000000faeb8] .__lock_acquire+0x1878/0x2190 [c000000b2754f600] [c0000000000fbf0c] .lock_acquire+0xcc/0x1a0 [c000000b2754f6d0] [c000000000b2954c] ._raw_spin_lock_irq+0x4c/0x70 [c000000b2754f760] [d00000000ecb1fe8] .vcore_stolen_time+0x48/0xd0 [kvm_hv] [c000000b2754f7f0] [d00000000ecb25b4] .kvmppc_remove_runnable.part.3+0x44/0xd0 [kvm_hv] [c000000b2754f880] [d00000000ecb43ec] .kvmppc_vcpu_run_hv+0x76c/0x1530 [kvm_hv] [c000000b2754f9f0] [d00000000eb9f46c] .kvmppc_vcpu_run+0x2c/0x40 [kvm] [c000000b2754fa60] [d00000000eb9c9a4] .kvm_arch_vcpu_ioctl_run+0x54/0x160 [kvm] [c000000b2754faf0] [d00000000eb94538] .kvm_vcpu_ioctl+0x498/0x760 [kvm] [c000000b2754fcb0] [c000000000267eb4] .do_vfs_ioctl+0x444/0x770 [c000000b2754fd90] [c0000000002682a4] .SyS_ioctl+0xc4/0xe0 [c000000b2754fe30] [c0000000000092e4] syscall_exit+0x0/0x98 In order to make the locking easier to analyse, we change the code to use a spinlock in the kvmppc_vcore struct to protect the stolen_tb and preempt_tb fields. This lock needs to be an irq-safe lock since it is used in the kvmppc_core_vcpu_load_hv() and kvmppc_core_vcpu_put_hv() functions, which are called with the scheduler rq lock held, which is an irq-safe lock. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 15 12月, 2014 2 次提交
-
-
由 Suresh E. Warrier 提交于
The kvmppc_vcore_blocked() code does not check for the wait condition after putting the process on the wait queue. This means that it is possible for an external interrupt to become pending, but the vcpu to remain asleep until the next decrementer interrupt. The fix is to make one last check for pending exceptions and ceded state before calling schedule(). Signed-off-by: NSuresh Warrier <warrier@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Mahesh Salgaonkar 提交于
When we get an HMI (hypervisor maintenance interrupt) while in a guest, we see that guest enters into paused state. The reason is, in kvmppc_handle_exit_hv it falls through default path and returns to host instead of resuming guest. This causes guest to enter into paused state. HMI is a hypervisor only interrupt and it is safe to resume the guest since the host has handled it already. This patch adds a switch case to resume the guest. Without this patch we see guest entering into paused state with following console messages: [ 3003.329351] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329356] Error detail: Timer facility experienced an error [ 3003.329359] HMER: 0840000000000000 [ 3003.329360] TFMR: 4a12000980a84000 [ 3003.329366] vcpu c0000007c35094c0 (40): [ 3003.329368] pc = c0000000000c2ba0 msr = 8000000000009032 trap = e60 [ 3003.329370] r 0 = c00000000021ddc0 r16 = 0000000000000046 [ 3003.329372] r 1 = c00000007a02bbd0 r17 = 00003ffff27d5d98 [ 3003.329375] r 2 = c0000000010980b8 r18 = 00001fffffc9a0b0 [ 3003.329377] r 3 = c00000000142d6b8 r19 = c00000000142d6b8 [ 3003.329379] r 4 = 0000000000000002 r20 = 0000000000000000 [ 3003.329381] r 5 = c00000000524a110 r21 = 0000000000000000 [ 3003.329383] r 6 = 0000000000000001 r22 = 0000000000000000 [ 3003.329386] r 7 = 0000000000000000 r23 = c00000000524a110 [ 3003.329388] r 8 = 0000000000000000 r24 = 0000000000000001 [ 3003.329391] r 9 = 0000000000000001 r25 = c00000007c31da38 [ 3003.329393] r10 = c0000000014280b8 r26 = 0000000000000002 [ 3003.329395] r11 = 746f6f6c2f68656c r27 = c00000000524a110 [ 3003.329397] r12 = 0000000028004484 r28 = c00000007c31da38 [ 3003.329399] r13 = c00000000fe01400 r29 = 0000000000000002 [ 3003.329401] r14 = 0000000000000046 r30 = c000000003011e00 [ 3003.329403] r15 = ffffffffffffffba r31 = 0000000000000002 [ 3003.329404] ctr = c00000000041a670 lr = c000000000272520 [ 3003.329405] srr0 = c00000000007e8d8 srr1 = 9000000000001002 [ 3003.329406] sprg0 = 0000000000000000 sprg1 = c00000000fe01400 [ 3003.329407] sprg2 = c00000000fe01400 sprg3 = 0000000000000005 [ 3003.329408] cr = 48004482 xer = 2000000000000000 dsisr = 42000000 [ 3003.329409] dar = 0000010015020048 [ 3003.329410] fault dar = 0000010015020048 dsisr = 42000000 [ 3003.329411] SLB (8 entries): [ 3003.329412] ESID = c000000008000000 VSID = 40016e7779000510 [ 3003.329413] ESID = d000000008000001 VSID = 400142add1000510 [ 3003.329414] ESID = f000000008000004 VSID = 4000eb1a81000510 [ 3003.329415] ESID = 00001f000800000b VSID = 40004fda0a000d90 [ 3003.329416] ESID = 00003f000800000c VSID = 400039f536000d90 [ 3003.329417] ESID = 000000001800000d VSID = 0001251b35150d90 [ 3003.329417] ESID = 000001000800000e VSID = 4001e46090000d90 [ 3003.329418] ESID = d000080008000019 VSID = 40013d349c000400 [ 3003.329419] lpcr = c048800001847001 sdr1 = 0000001b19000006 last_inst = ffffffff [ 3003.329421] trap=0xe60 | pc=0xc0000000000c2ba0 | msr=0x8000000000009032 [ 3003.329524] Severe Hypervisor Maintenance interrupt [Recovered] [ 3003.329526] Error detail: Timer facility experienced an error [ 3003.329527] HMER: 0840000000000000 [ 3003.329527] TFMR: 4a12000980a94000 [ 3006.359786] Severe Hypervisor Maintenance interrupt [Recovered] [ 3006.359792] Error detail: Timer facility experienced an error [ 3006.359795] HMER: 0840000000000000 [ 3006.359797] TFMR: 4a12000980a84000 Id Name State ---------------------------------------------------- 2 guest2 running 3 guest3 paused 4 guest4 running Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 22 9月, 2014 3 次提交
-
-
由 Madhavan Srinivasan 提交于
This patch adds kernel side support for software breakpoint. Design is that, by using an illegal instruction, we trap to hypervisor via Emulation Assistance interrupt, where we check for the illegal instruction and accordingly we return to Host or Guest. Patch also adds support for software breakpoint in PR KVM. Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Since the guest can read the machine's PVR (Processor Version Register) directly and see the real value, we should disallow userspace from setting any value for the guest's PVR other than the real host value. Therefore this makes kvm_arch_vcpu_set_sregs_hv() check the supplied PVR value and return an error if it is different from the host value, which has been put into vcpu->arch.pvr at vcpu creation time. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
Occasional failures have been seen with split-core mode and migration where the message "KVM: couldn't grab cpu" appears. This increases the length of time that we wait from 1ms to 10ms, which seems to work around the issue. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 28 7月, 2014 10 次提交
-
-
由 Aneesh Kumar K.V 提交于
When calculating the lower bits of AVA field, use the shift count based on the base page size. Also add the missing segment size and remove stale comment. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Stewart Smith 提交于
The POWER8 processor has a Micro Partition Prefetch Engine, which is a fancy way of saying "has way to store and load contents of L2 or L2+MRU way of L3 cache". We initiate the storing of the log (list of addresses) using the logmpp instruction and start restore by writing to a SPR. The logmpp instruction takes parameters in a single 64bit register: - starting address of the table to store log of L2/L2+L3 cache contents - 32kb for L2 - 128kb for L2+L3 - Aligned relative to maximum size of the table (32kb or 128kb) - Log control (no-op, L2 only, L2 and L3, abort logout) We should abort any ongoing logging before initiating one. To initiate restore, we write to the MPPR SPR. The format of what to write to the SPR is similar to the logmpp instruction parameter: - starting address of the table to read from (same alignment requirements) - table size (no data, until end of table) - prefetch rate (from fastest possible to slower. about every 8, 16, 24 or 32 cycles) The idea behind loading and storing the contents of L2/L3 cache is to reduce memory latency in a system that is frequently swapping vcores on a physical CPU. The best case scenario for doing this is when some vcores are doing very cache heavy workloads. The worst case is when they have about 0 cache hits, so we just generate needless memory operations. This implementation just does L2 store/load. In my benchmarks this proves to be useful. Benchmark 1: - 16 core POWER8 - 3x Ubuntu 14.04LTS guests (LE) with 8 VCPUs each - No split core/SMT - two guests running sysbench memory test. sysbench --test=memory --num-threads=8 run - one guest running apache bench (of default HTML page) ab -n 490000 -c 400 http://localhost/ This benchmark aims to measure performance of real world application (apache) where other guests are cache hot with their own workloads. The sysbench memory benchmark does pointer sized writes to a (small) memory buffer in a loop. In this benchmark with this patch I can see an improvement both in requests per second (~5%) and in mean and median response times (again, about 5%). The spread of minimum and maximum response times were largely unchanged. benchmark 2: - Same VM config as benchmark 1 - all three guests running sysbench memory benchmark This benchmark aims to see if there is a positive or negative affect to this cache heavy benchmark. Although due to the nature of the benchmark (stores) we may not see a difference in performance, but rather hopefully an improvement in consistency of performance (when vcore switched in, don't have to wait many times for cachelines to be pulled in) The results of this benchmark are improvements in consistency of performance rather than performance itself. With this patch, the few outliers in duration go away and we get more consistent performance in each guest. benchmark 3: - same 3 guests and CPU configuration as benchmark 1 and 2. - two idle guests - 1 guest running STREAM benchmark This scenario also saw performance improvement with this patch. On Copy and Scale workloads from STREAM, I got 5-6% improvement with this patch. For Add and triad, it was around 10% (or more). benchmark 4: - same 3 guests as previous benchmarks - two guests running sysbench --memory, distinctly different cache heavy workload - one guest running STREAM benchmark. Similar improvements to benchmark 3. benchmark 5: - 1 guest, 8 VCPUs, Ubuntu 14.04 - Host configured with split core (SMT8, subcores-per-core=4) - STREAM benchmark In this benchmark, we see a 10-20% performance improvement across the board of STREAM benchmark results with this patch. Based on preliminary investigation and microbenchmarks by Prerna Saxena <prerna@linux.vnet.ibm.com> Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Stewart Smith 提交于
No code changes, just split it out to a function so that with the addition of micro partition prefetch buffer allocation (in subsequent patch) looks neater and doesn't require excessive indentation. Signed-off-by: NStewart Smith <stewart@linux.vnet.ibm.com> Acked-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexey Kardashevskiy 提交于
Unfortunately, the LPCR got defined as a 32-bit register in the one_reg interface. This is unfortunate because KVM allows userspace to control the DPFD (default prefetch depth) field, which is in the upper 32 bits. The result is that DPFD always get set to 0, which reduces performance in the guest. We can't just change KVM_REG_PPC_LPCR to be a 64-bit register ID, since that would break existing userspace binaries. Instead we define a new KVM_REG_PPC_LPCR_64 id which is 64-bit. Userspace can still use the old KVM_REG_PPC_LPCR id, but it now only modifies those fields in the bottom 32 bits that userspace can modify (ILE, TC and AIL). If userspace uses the new KVM_REG_PPC_LPCR_64 id, it can modify DPFD as well. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NPaul Mackerras <paulus@samba.org> Cc: stable@vger.kernel.org Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
There are a few shared data structures between the host and the guest. Most of them get registered through the VPA interface. These data structures are defined to always be in big endian byte order, so let's make sure we always access them in big endian. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Michael Neuling 提交于
This adds support for the H_SET_MODE hcall. This hcall is a multiplexer that has several functions, some of which are called rarely, and some which are potentially called very frequently. Here we add support for the functions that set the debug registers CIABR (Completed Instruction Address Breakpoint Register) and DAWR/DAWRX (Data Address Watchpoint Register and eXtension), since they could be updated by the guest as often as every context switch. This also adds a kvmppc_power8_compatible() function to test to see if a guest is compatible with POWER8 or not. The CIABR and DAWR/X only exist on POWER8. Signed-off-by: NMichael Neuling <mikey@neuling.org> Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL capability is used to enable or disable in-kernel handling of an hcall, that the hcall is actually implemented by the kernel. If not an EINVAL error is returned. This also checks the default-enabled list of hcalls and prints a warning if any hcall there is not actually implemented. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Paul Mackerras 提交于
This provides a way for userspace controls which sPAPR hcalls get handled in the kernel. Each hcall can be individually enabled or disabled for in-kernel handling, except for H_RTAS. The exception for H_RTAS is because userspace can already control whether individual RTAS functions are handled in-kernel or not via the KVM_PPC_RTAS_DEFINE_TOKEN ioctl, and because the numeric value for H_RTAS is out of the normal sequence of hcall numbers. Hcalls are enabled or disabled using the KVM_ENABLE_CAP ioctl for the KVM_CAP_PPC_ENABLE_HCALL capability on the file descriptor for the VM. The args field of the struct kvm_enable_cap specifies the hcall number in args[0] and the enable/disable flag in args[1]; 0 means disable in-kernel handling (so that the hcall will always cause an exit to userspace) and 1 means enable. Enabling or disabling in-kernel handling of an hcall is effective across the whole VM. The ability for KVM_ENABLE_CAP to be used on a VM file descriptor on PowerPC is new, added by this commit. The KVM_CAP_ENABLE_CAP_VM capability advertises that this ability exists. When a VM is created, an initial set of hcalls are enabled for in-kernel handling. The set that is enabled is the set that have an in-kernel implementation at this point. Any new hcall implementations from this point onwards should not be added to the default set without a good reason. No distinction is made between real-mode and virtual-mode hcall implementations; the one setting controls them both. Signed-off-by: NPaul Mackerras <paulus@samba.org> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
Writing to IC is not allowed in the privileged mode. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Aneesh Kumar K.V 提交于
virtual time base register is a per VM, per cpu register that needs to be saved and restored on vm exit and entry. Writing to VTB is not allowed in the privileged mode. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: fix compile error] Signed-off-by: NAlexander Graf <agraf@suse.de>
-
- 30 5月, 2014 5 次提交
-
-
由 Aneesh Kumar K.V 提交于
On recent IBM Power CPUs, while the hashed page table is looked up using the page size from the segmentation hardware (i.e. the SLB), it is possible to have the HPT entry indicate a larger page size. Thus for example it is possible to put a 16MB page in a 64kB segment, but since the hash lookup is done using a 64kB page size, it may be necessary to put multiple entries in the HPT for a single 16MB page. This capability is called mixed page-size segment (MPSS). With MPSS, there are two relevant page sizes: the base page size, which is the size used in searching the HPT, and the actual page size, which is the size indicated in the HPT entry. [ Note that the actual page size is always >= base page size ]. We use "ibm,segment-page-sizes" device tree node to advertise the MPSS support to PAPR guest. The penc encoding indicates whether we support a specific combination of base page size and actual page size in the same segment. We also use the penc value in the LP encoding of HPTE entry. This patch exposes MPSS support to KVM guest by advertising the feature via "ibm,segment-page-sizes". It also adds the necessary changes to decode the base page size and the actual page size correctly from the HPTE entry. Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
POWER8 introduces a new facility called the "Event Based Branch" facility. It contains of a few registers that indicate where a guest should branch to when a defined event occurs and it's in PR mode. We don't want to really enable EBB as it will create a big mess with !PR guest mode while hardware is in PR and we don't really emulate the PMU anyway. So instead, let's just leave it at emulation of all its registers. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
POWER8 implements a new register called TAR. This register has to be enabled in FSCR and then from KVM's point of view is mere storage. This patch enables the guest to use TAR. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
POWER8 introduced a new interrupt type called "Facility unavailable interrupt" which contains its status message in a new register called FSCR. Handle these exits and try to emulate instructions for unhandled facilities. Follow-on patches enable KVM to expose specific facilities into the guest. Signed-off-by: NAlexander Graf <agraf@suse.de>
-
由 Alexander Graf 提交于
The shared (magic) page is a data structure that contains often used supervisor privileged SPRs accessible via memory to the user to reduce the number of exits we have to take to read/write them. When we actually share this structure with the guest we have to maintain it in guest endianness, because some of the patch tricks only work with native endian load/store operations. Since we only share the structure with either host or guest in little endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv. For booke, the shared struct stays big endian. For book3s_64 hv we maintain the struct in host native endian, since it never gets shared with the guest. For book3s_64 pr we introduce a variable that tells us which endianness the shared struct is in and route every access to it through helper inline functions that evaluate this variable. Signed-off-by: NAlexander Graf <agraf@suse.de>
-