- 11 11月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
Reported by syzkaller: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 14727 Comm: syz-executor.3 Not tainted 5.4.0-rc4+ #0 RIP: 0010:kvm_coalesced_mmio_init+0x5d/0x110 arch/x86/kvm/../../../virt/kvm/coalesced_mmio.c:121 Call Trace: kvm_dev_ioctl_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:3446 [inline] kvm_dev_ioctl+0x781/0x1490 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3494 vfs_ioctl fs/ioctl.c:46 [inline] file_ioctl fs/ioctl.c:509 [inline] do_vfs_ioctl+0x196/0x1150 fs/ioctl.c:696 ksys_ioctl+0x62/0x90 fs/ioctl.c:713 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x6e/0xb0 fs/ioctl.c:718 do_syscall_64+0xca/0x5d0 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe Commit 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm") moves memslots and buses allocations around, however, if kvm->srcu/irq_srcu fails initialization, NULL will be returned instead of error code, NULL will not be intercepted in kvm_dev_ioctl_create_vm() and be dereferenced by kvm_coalesced_mmio_init(), this patch fixes it. Moving the initialization is required anyway to avoid an incorrect synchronize_srcu that was also reported by syzkaller: wait_for_completion+0x29c/0x440 kernel/sched/completion.c:136 __synchronize_srcu+0x197/0x250 kernel/rcu/srcutree.c:921 synchronize_srcu_expedited kernel/rcu/srcutree.c:946 [inline] synchronize_srcu+0x239/0x3e8 kernel/rcu/srcutree.c:997 kvm_page_track_unregister_notifier+0xe7/0x130 arch/x86/kvm/page_track.c:212 kvm_mmu_uninit_vm+0x1e/0x30 arch/x86/kvm/mmu.c:5828 kvm_arch_destroy_vm+0x4a2/0x5f0 arch/x86/kvm/x86.c:9579 kvm_create_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:702 [inline] so do it. Reported-by: syzbot+89a8060879fa0bd2db4f@syzkaller.appspotmail.com Reported-by: syzbot+e27e7027eb2b80e44225@syzkaller.appspotmail.com Fixes: 9121923c ("kvm: Allocate memslots and buses before calling kvm_arch_init_vm") Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 31 10月, 2019 1 次提交
-
-
由 Jim Mattson 提交于
In kvm_create_vm(), if we've successfully called kvm_arch_init_vm(), but then fail later in the function, we need to call kvm_arch_destroy_vm() so that it can do any necessary cleanup (like freeing memory). Fixes: 44a95dae ("KVM: x86: Detect and Initialize AVIC support") Signed-off-by: NJohn Sperbeck <jsperbeck@google.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NJunaid Shahid <junaids@google.com> [Remove dependency on "kvm: Don't clear reference count on kvm_create_vm() error path" which was not committed. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 10月, 2019 1 次提交
-
-
由 Jim Mattson 提交于
This reorganization will allow us to call kvm_arch_destroy_vm in the event that kvm_create_vm fails after calling kvm_arch_init_vm. Suggested-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 10月, 2019 1 次提交
-
-
由 Wanpeng Li 提交于
Don't waste cycles to shrink/grow vCPU halt_poll_ns if host side polling is disabled. Acked-by: NMarcelo Tosatti <mtosatti@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 10月, 2019 3 次提交
-
-
由 Marc Zyngier 提交于
The PMU emulation code uses the perf event sample period to trigger the overflow detection. This works fine for the *first* overflow handling, but results in a huge number of interrupts on the host, unrelated to the number of interrupts handled in the guest (a x20 factor is pretty common for the cycle counter). On a slow system (such as a SW model), this can result in the guest only making forward progress at a glacial pace. It turns out that the clue is in the name. The sample period is exactly that: a period. And once the an overflow has occured, the following period should be the full width of the associated counter, instead of whatever the guest had initially programed. Reset the sample period to the architected value in the overflow handler, which now results in a number of host interrupts that is much closer to the number of interrupts in the guest. Fixes: b02386eb ("arm64: KVM: Add PMU overflow interrupt routing") Reviewed-by: NAndrew Murray <andrew.murray@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
The current convention for KVM to request a chained event from the host PMU is to set bit[0] in attr.config1 (PERF_ATTR_CFG1_KVM_PMU_CHAINED). But as it turns out, this bit gets set *after* we create the kernel event that backs our virtual counter, meaning that we never get a 64bit counter. Moving the setting to an earlier point solves the problem. Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters") Reviewed-by: NAndrew Murray <andrew.murray@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
When a counter is disabled, its value is sampled before the event is being disabled, and the value written back in the shadow register. In that process, the value gets truncated to 32bit, which is adequate for any counter but the cycle counter (defined as a 64bit counter). This obviously results in a corrupted counter, and things like "perf record -e cycles" not working at all when run in a guest... A similar, but less critical bug exists in kvm_pmu_get_counter_value. Make the truncation conditional on the counter not being the cycle counter, which results in a minor code reorganisation. Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters") Reviewed-by: NAndrew Murray <andrew.murray@arm.com> Reported-by: NJulien Thierry <julien.thierry.kdev@gmail.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 01 10月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
The largepages debugfs entry is incremented/decremented as shadow pages are created or destroyed. Clearing it will result in an underflow, which is harmless to KVM but ugly (and could be misinterpreted by tools that use debugfs information), so make this particular statistic read-only. Cc: kvm-ppc@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 9月, 2019 1 次提交
-
-
由 Matt Delco 提交于
The first/last indexes are typically shared with a user app. The app can change the 'last' index that the kernel uses to store the next result. This change sanity checks the index before using it for writing to a potentially arbitrary address. This fixes CVE-2019-14821. Cc: stable@vger.kernel.org Fixes: 5f94c174 ("KVM: Add coalesced MMIO support (common part)") Signed-off-by: NMatt Delco <delco@chromium.org> Signed-off-by: NJim Mattson <jmattson@google.com> Reported-by: syzbot+983c866c3dd6efa3662a@syzkaller.appspotmail.com [Use READ_ONCE. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 9月, 2019 1 次提交
-
-
由 Zenghui Yu 提交于
Commit 49dfe94f ("KVM: arm/arm64: Fix TRACE_INCLUDE_PATH") fixes TRACE_INCLUDE_PATH to the correct relative path to the define_trace.h and explains why did the old one work. The same fix should be applied to virt/kvm/arm/vgic/trace.h. Reviewed-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NZenghui Yu <yuzenghui@huawei.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 09 9月, 2019 1 次提交
-
-
由 Marc Zyngier 提交于
While parts of the VGIC support a large number of vcpus (we bravely allow up to 512), other parts are more limited. One of these limits is visible in the KVM_IRQ_LINE ioctl, which only allows 256 vcpus to be signalled when using the CPU or PPI types. Unfortunately, we've cornered ourselves badly by allocating all the bits in the irq field. Since the irq_type subfield (8 bit wide) is currently only taking the values 0, 1 and 2 (and we have been careful not to allow anything else), let's reduce this field to only 4 bits, and allocate the remaining 4 bits to a vcpu2_index, which acts as a multiplier: vcpu_id = 256 * vcpu2_index + vcpu_index With that, and a new capability (KVM_CAP_ARM_IRQ_LINE_LAYOUT_2) allowing this to be discovered, it becomes possible to inject PPIs to up to 4096 vcpus. But please just don't. Whilst we're there, add a clarification about the use of KVM_IRQ_LINE on arm, which is not completely conditionned by KVM_CAP_IRQCHIP. Reported-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 28 8月, 2019 1 次提交
-
-
由 Marc Zyngier 提交于
A guest is not allowed to inject a SGI (or clear its pending state) by writing to GICD_ISPENDR0 (resp. GICD_ICPENDR0), as these bits are defined as WI (as per ARM IHI 0048B 4.3.7 and 4.3.8). Make sure we correctly emulate the architecture. Fixes: 96b29800 ("KVM: arm/arm64: vgic-new: Add PENDING registers handlers") Cc: stable@vger.kernel.org # 4.7+ Reported-by: NAndre Przywara <andre.przywara@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org> Signed-off-by: NWill Deacon <will@kernel.org>
-
- 27 8月, 2019 1 次提交
-
-
由 Heyi Guo 提交于
If the ap_list is longer than 256 entries, merge_final() in list_sort() will call the comparison callback with the same element twice, causing a deadlock in vgic_irq_cmp(). Fix it by returning early when irqa == irqb. Cc: stable@vger.kernel.org # 4.7+ Fixes: 8e444745 ("KVM: arm/arm64: vgic-new: Add IRQ sorting") Signed-off-by: NZenghui Yu <yuzenghui@huawei.com> Signed-off-by: NHeyi Guo <guoheyi@huawei.com> [maz: massaged commit log and patch, added Fixes and Cc-stable] Signed-off-by: NMarc Zyngier <maz@kernel.org> Signed-off-by: NWill Deacon <will@kernel.org>
-
- 25 8月, 2019 2 次提交
-
-
由 Eric Auger 提交于
At the moment we use 2 IO devices per GICv3 redistributor: one one for the RD_base frame and one for the SGI_base frame. Instead we can use a single IO device per redistributor (the 2 frames are contiguous). This saves slots on the KVM_MMIO_BUS which is currently limited to NR_IOBUS_DEVS (1000). This change allows to instantiate up to 512 redistributors and may speed the guest boot with a large number of VCPUs. Signed-off-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
Detected by Coccinelle (and Will Deacon) using scripts/coccinelle/misc/semicolon.cocci. Reported-by: NWill Deacon <will@kernel.org> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 24 8月, 2019 1 次提交
-
-
由 Andre Przywara 提交于
At the moment we initialise the target *mask* of a virtual IRQ to the VCPU it belongs to, even though this mask is only defined for GICv2 and quickly runs out of bits for many GICv3 guests. This behaviour triggers an UBSAN complaint for more than 32 VCPUs: ------ [ 5659.462377] UBSAN: Undefined behaviour in virt/kvm/arm/vgic/vgic-init.c:223:21 [ 5659.471689] shift exponent 32 is too large for 32-bit type 'unsigned int' ------ Also for GICv3 guests the reporting of TARGET in the "vgic-state" debugfs dump is wrong, due to this very same problem. Because there is no requirement to create the VGIC device before the VCPUs (and QEMU actually does it the other way round), we can't safely initialise mpidr or targets in kvm_vgic_vcpu_init(). But since we touch every private IRQ for each VCPU anyway later (in vgic_init()), we can just move the initialisation of those fields into there, where we definitely know the VGIC type. On the way make sure we really have either a VGICv2 or a VGICv3 device, since the existing code is just checking for "VGICv3 or not", silently ignoring the uninitialised case. Signed-off-by: NAndre Przywara <andre.przywara@arm.com> Reported-by: NDave Martin <dave.martin@arm.com> Tested-by: NJulien Grall <julien.grall@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 22 8月, 2019 1 次提交
-
-
由 Andrew Jones 提交于
If after an MMIO exit to userspace a VCPU is immediately run with an immediate_exit request, such as when a signal is delivered or an MMIO emulation completion is needed, then the VCPU completes the MMIO emulation and immediately returns to userspace. As the exit_reason does not get changed from KVM_EXIT_MMIO in these cases we have to be careful not to complete the MMIO emulation again, when the VCPU is eventually run again, because the emulation does an instruction skip (and doing too many skips would be a waste of guest code :-) We need to use additional VCPU state to track if the emulation is complete. As luck would have it, we already have 'mmio_needed', which even appears to be used in this way by other architectures already. Fixes: 0d640732 ("arm64: KVM: Skip MMIO insn after emulation") Acked-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NAndrew Jones <drjones@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 19 8月, 2019 12 次提交
-
-
由 Marc Zyngier 提交于
When a vpcu is about to block by calling kvm_vcpu_block, we call back into the arch code to allow any form of synchronization that may be required at this point (SVN stops the AVIC, ARM synchronises the VMCR and enables GICv4 doorbells). But this synchronization comes in quite late, as we've potentially waited for halt_poll_ns to expire. Instead, let's move kvm_arch_vcpu_blocking() to the beginning of kvm_vcpu_block(), which on ARM has several benefits: - VMCR gets synchronised early, meaning that any interrupt delivered during the polling window will be evaluated with the correct guest PMR - GICv4 doorbells are enabled, which means that any guest interrupt directly injected during that window will be immediately recognised Tang Nianyao ran some tests on a GICv4 machine to evaluate such change, and reported up to a 10% improvement for netperf: <quote> netperf result: D06 as server, intel 8180 server as client with change: package 512 bytes - 5500 Mbits/s package 64 bytes - 760 Mbits/s without change: package 512 bytes - 5000 Mbits/s package 64 bytes - 710 Mbits/s </quote> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Alexandru Elisei 提交于
Since commit 503a6286 ("KVM: arm/arm64: vgic: Rely on the GIC driver to parse the firmware tables"), the vgic_v{2,3}_probe functions stopped using a DT node. Commit 90977732 ("KVM: arm/arm64: vgic-new: vgic_init: implement kvm_vgic_hyp_init") changed the functions again, and now they require exactly one argument, a struct gic_kvm_info populated by the GIC driver. Unfortunately the comments regressed and state that a DT node is used instead. Change the function comments to reflect the current prototypes. Signed-off-by: NAlexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
Now that we have a cache of MSI->LPI translations, it is pretty easy to implement kvm_arch_set_irq_inatomic (this cache can be parsed without sleeping). Hopefully, this will improve some LPI-heavy workloads. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
When performing an MSI injection, let's first check if the translation is already in the cache. If so, let's inject it quickly without going through the whole translation process. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
On a successful translation, preserve the parameters in the LPI translation cache. Each translation is reusing the last slot in the list, naturally evicting the least recently used entry. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
In order to avoid leaking vgic_irq structures on teardown, we need to drop all references to LPIs before deallocating the cache itself. This is done by invalidating the cache on vgic teardown. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
If an ITS gets disabled, we need to make sure that further interrupts won't hit in the cache. For that, we invalidate the translation cache when the ITS is disabled. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
If a vcpu disables LPIs at its redistributor level, we need to make sure we won't pend more interrupts. For this, we need to invalidate the LPI translation cache. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
The LPI translation cache needs to be discarded when an ITS command may affect the translation of an LPI (DISCARD, MAPC and MAPD with V=0) or the routing of an LPI to a redistributor with disabled LPIs (MOVI, MOVALL). We decide to perform a full invalidation of the cache, irrespective of the LPI that is affected. Commands are supposed to be rare enough that it doesn't matter. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
There's a number of cases where we need to invalidate the caching of translations, so let's add basic support for that. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
Our LPI translation cache needs to be able to drop the refcount on an LPI whilst already holding the lpi_list_lock. Let's add a new primitive for this. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Marc Zyngier 提交于
Add the basic data structure that expresses an MSI to LPI translation as well as the allocation/release hooks. The size of the cache is arbitrarily defined as 16*nr_vcpus. Tested-by: NAndre Przywara <andre.przywara@arm.com> Reviewed-by: NEric Auger <eric.auger@redhat.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 09 8月, 2019 2 次提交
-
-
由 Paolo Bonzini 提交于
The same check is already done in kvm_is_reserved_pfn. Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Alexandru Elisei 提交于
A HW mapped level sensitive interrupt asserted by a device will not be put into the ap_list if it is disabled at the VGIC level. When it is enabled again, it will be inserted into the ap_list and written to a list register on guest entry regardless of the state of the device. We could argue that this can also happen on real hardware, when the command to enable the interrupt reached the GIC before the device had the chance to de-assert the interrupt signal; however, we emulate the distributor and redistributors in software and we can do better than that. Signed-off-by: NAlexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 05 8月, 2019 5 次提交
-
-
由 Marc Zyngier 提交于
Since commit commit 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put"), we leave ICH_VMCR_EL2 (or its GICv2 equivalent) loaded as long as we can, only syncing it back when we're scheduled out. There is a small snag with that though: kvm_vgic_vcpu_pending_irq(), which is indirectly called from kvm_vcpu_check_block(), needs to evaluate the guest's view of ICC_PMR_EL1. At the point were we call kvm_vcpu_check_block(), the vcpu is still loaded, and whatever changes to PMR is not visible in memory until we do a vcpu_put(). Things go really south if the guest does the following: mov x0, #0 // or any small value masking interrupts msr ICC_PMR_EL1, x0 [vcpu preempted, then rescheduled, VMCR sampled] mov x0, #ff // allow all interrupts msr ICC_PMR_EL1, x0 wfi // traps to EL2, so samping of VMCR [interrupt arrives just after WFI] Here, the hypervisor's view of PMR is zero, while the guest has enabled its interrupts. kvm_vgic_vcpu_pending_irq() will then say that no interrupts are pending (despite an interrupt being received) and we'll block for no reason. If the guest doesn't have a periodic interrupt firing once it has blocked, it will stay there forever. To avoid this unfortuante situation, let's resync VMCR from kvm_arch_vcpu_blocking(), ensuring that a following kvm_vcpu_check_block() will observe the latest value of PMR. This has been found by booting an arm64 Linux guest with the pseudo NMI feature, and thus using interrupt priorities to mask interrupts instead of the usual PSTATE masking. Cc: stable@vger.kernel.org # 4.12 Fixes: 328e5664 ("KVM: arm/arm64: vgic: Defer touching GICH_VMCR to vcpu_load/put") Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
由 Greg KH 提交于
When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return void instead of an integer, as we should not care at all about if this function actually does anything or not. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <x86@kernel.org> Cc: <kvm@vger.kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
There is no need for this function as all arches have to implement kvm_arch_create_vcpu_debugfs() no matter what. A #define symbol let us actually simplify the code. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting in the VMs after stress testing: INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073) Call Trace: flush_tlb_mm_range+0x68/0x140 tlb_flush_mmu.part.75+0x37/0xe0 tlb_finish_mmu+0x55/0x60 zap_page_range+0x142/0x190 SyS_madvise+0x3cd/0x9c0 system_call_fastpath+0x1c/0x21 swait_active() sustains to be true before finish_swait() is called in kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop greatly increases the probability condition kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv is enabled the yield-candidate vCPU's VMCS RVI field leaks(by vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current VMCS. This patch fixes it by checking conservatively a subset of events. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Marc Zyngier <Marc.Zyngier@arm.com> Cc: stable@vger.kernel.org Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop) Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
preempted_in_kernel is updated in preempt_notifier when involuntary preemption ocurrs, it can be stale when the voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin() loop. This patch lets it just check preempted_in_kernel for involuntary preemption. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 7月, 2019 1 次提交
-
-
由 Anders Roxell 提交于
When fall-through warnings was enabled by default the following warnings was starting to show up: ../virt/kvm/arm/hyp/vgic-v3-sr.c: In function ‘__vgic_v3_save_aprs’: ../virt/kvm/arm/hyp/vgic-v3-sr.c:351:24: warning: this statement may fall through [-Wimplicit-fallthrough=] cpu_if->vgic_ap0r[2] = __vgic_v3_read_ap0rn(2); ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ ../virt/kvm/arm/hyp/vgic-v3-sr.c:352:2: note: here case 6: ^~~~ ../virt/kvm/arm/hyp/vgic-v3-sr.c:353:24: warning: this statement may fall through [-Wimplicit-fallthrough=] cpu_if->vgic_ap0r[1] = __vgic_v3_read_ap0rn(1); ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ ../virt/kvm/arm/hyp/vgic-v3-sr.c:354:2: note: here default: ^~~~~~~ Rework so that the compiler doesn't warn about fall-through. Fixes: d93512ef0f0e ("Makefile: Globally enable fall-through warning") Signed-off-by: NAnders Roxell <anders.roxell@linaro.org> Signed-off-by: NMarc Zyngier <maz@kernel.org>
-
- 24 7月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
Renaming docs seems to be en vogue at the moment, so fix on of the grossly misnamed directories. We usually never use "virtual" as a shortcut for virtualization in the kernel, but always virt, as seen in the virt/ top-level directory. Fix up the documentation to match that. Fixes: ed16648e ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:") Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 23 7月, 2019 1 次提交
-
-
由 Zenghui Yu 提交于
We use "pmc->idx" and the "chained" bitmap to determine if the pmc is chained, in kvm_pmu_pmc_is_chained(). But idx might be uninitialized (and random) when we doing this decision, through a KVM_ARM_VCPU_INIT ioctl -> kvm_pmu_vcpu_reset(). And the test_bit() against this random idx will potentially hit a KASAN BUG [1]. In general, idx is the static property of a PMU counter that is not expected to be modified across resets, as suggested by Julien. It looks more reasonable if we can setup the PMU counter idx for a vcpu in its creation time. Introduce a new function - kvm_pmu_vcpu_init() for this basic setup. Oh, and the KASAN BUG will get fixed this way. [1] https://www.spinics.net/lists/kvm-arm/msg36700.html Fixes: 80f393a2 ("KVM: arm/arm64: Support chained PMU counters") Suggested-by: NAndrew Murray <andrew.murray@arm.com> Suggested-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NJulien Thierry <julien.thierry@arm.com> Signed-off-by: NZenghui Yu <yuzenghui@huawei.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
- 20 7月, 2019 1 次提交
-
-
由 Wanpeng Li 提交于
Inspired by commit 9cac38dd (KVM/s390: Set preempted flag during vcpu wakeup and interrupt delivery), we want to also boost not just lock holders but also vCPUs that are delivering interrupts. Most smp_call_function_many calls are synchronous, so the IPI target vCPUs are also good yield candidates. This patch introduces vcpu->ready to boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected (VT-d PI handles voluntary preemption separately, in pi_pre_block). Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM: ebizzy -M vanilla boosting improved 1VM 21443 23520 9% 2VM 2800 8000 180% 3VM 1800 3100 72% Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs, one running ebizzy -M, the other running 'stress --cpu 2': w/ boosting + w/o pv sched yield(vanilla) vanilla boosting improved 1570 4000 155% w/ boosting + w/ pv sched yield(vanilla) vanilla boosting improved 1844 5157 179% w/o boosting, perf top in VM: 72.33% [kernel] [k] smp_call_function_many 4.22% [kernel] [k] call_function_i 3.71% [kernel] [k] async_page_fault w/ boosting, perf top in VM: 38.43% [kernel] [k] smp_call_function_many 6.31% [kernel] [k] async_page_fault 6.13% libc-2.23.so [.] __memcpy_avx_unaligned 4.88% [kernel] [k] call_function_interrupt Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Marc Zyngier <maz@kernel.org> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-