1. 19 7月, 2019 1 次提交
  2. 13 7月, 2019 1 次提交
  3. 11 7月, 2019 2 次提交
    • S
      KVM: x86: Unconditionally enable irqs in guest context · d7a08882
      Sean Christopherson 提交于
      On VMX, KVM currently does not re-enable irqs until after it has exited
      the guest context.  As a result, a tick that fires in the window between
      VM-Exit and guest_exit_irqoff() will be accounted as system time.  While
      said window is relatively small, it's large enough to be problematic in
      some configurations, e.g. if VM-Exits are consistently occurring a hair
      earlier than the tick irq.
      
      Intentionally toggle irqs back off so that guest_exit_irqoff() can be
      used in lieu of guest_exit() in order to avoid the save/restore of flags
      in guest_exit().  On my Haswell system, "nop; cli; sti" is ~6 cycles,
      versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf".
      
      Fixes: f2485b3e ("KVM: x86: use guest_exit_irqoff")
      Reported-by: NWei Yang <w90p710@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d7a08882
    • E
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland 提交于
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  4. 10 7月, 2019 1 次提交
  5. 06 7月, 2019 1 次提交
  6. 05 7月, 2019 14 次提交
  7. 03 7月, 2019 13 次提交
    • W
      KVM: LAPIC: remove the trailing newline used in the fmt parameter of TP_printk · 7be373b6
      Wanpeng Li 提交于
      The trailing newlines will lead to extra newlines in the trace file
      which looks like the following output, so remove it.
      
      qemu-system-x86-15695 [002] ...1 15774.839240: kvm_hv_timer_state: vcpu_id 0 hv_timer 1
      
      qemu-system-x86-15695 [002] ...1 15774.839309: kvm_hv_timer_state: vcpu_id 0 hv_timer 1
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7be373b6
    • P
      KVM: svm: add nrips module parameter · d647eb63
      Paolo Bonzini 提交于
      Allow testing code for old processors that lack the next RIP save
      feature, by disabling usage of the next_rip field.
      
      Nested hypervisors however get the feature unconditionally.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d647eb63
    • M
      clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic · dd2cb348
      Michael Kelley 提交于
      Continue consolidating Hyper-V clock and timer code into an ISA
      independent Hyper-V clocksource driver.
      
      Move the existing clocksource code under drivers/hv and arch/x86 to the new
      clocksource driver while separating out the ISA dependencies. Update
      Hyper-V initialization to call initialization and cleanup routines since
      the Hyper-V synthetic clock is not independently enumerated in ACPI.
      
      Update Hyper-V clocksource users in KVM and VDSO to get definitions from
      the new include file.
      
      No behavior is changed and no new functionality is added.
      Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
      dd2cb348
    • W
      KVM: LAPIC: Fix pending interrupt in IRR blocked by software disable LAPIC · bb34e690
      Wanpeng Li 提交于
      Thomas reported that:
      
       | Background:
       |
       |    In preparation of supporting IPI shorthands I changed the CPU offline
       |    code to software disable the local APIC instead of just masking it.
       |    That's done by clearing the APIC_SPIV_APIC_ENABLED bit in the APIC_SPIV
       |    register.
       |
       | Failure:
       |
       |    When the CPU comes back online the startup code triggers occasionally
       |    the warning in apic_pending_intr_clear(). That complains that the IRRs
       |    are not empty.
       |
       |    The offending vector is the local APIC timer vector who's IRR bit is set
       |    and stays set.
       |
       | It took me quite some time to reproduce the issue locally, but now I can
       | see what happens.
       |
       | It requires apicv_enabled=0, i.e. full apic emulation. With apicv_enabled=1
       | (and hardware support) it behaves correctly.
       |
       | Here is the series of events:
       |
       |     Guest CPU
       |
       |     goes down
       |
       |       native_cpu_disable()
       |
       | 			apic_soft_disable();
       |
       |     play_dead()
       |
       |     ....
       |
       |     startup()
       |
       |       if (apic_enabled())
       |         apic_pending_intr_clear()	<- Not taken
       |
       |      enable APIC
       |
       |         apic_pending_intr_clear()	<- Triggers warning because IRR is stale
       |
       | When this happens then the deadline timer or the regular APIC timer -
       | happens with both, has fired shortly before the APIC is disabled, but the
       | interrupt was not serviced because the guest CPU was in an interrupt
       | disabled region at that point.
       |
       | The state of the timer vector ISR/IRR bits:
       |
       |     	     	       	        ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      1
       |
       | Now one would assume that the IRR is cleared after the INIT reset, but this
       | happens only on CPU0.
       |
       | Why?
       |
       | Because our CPU0 hotplug is just for testing to make sure nothing breaks
       | and goes through an NMI wakeup vehicle because INIT would send it through
       | the boots-trap code which is not really working if that CPU was not
       | physically unplugged.
       |
       | Now looking at a real world APIC the situation in that case is:
       |
       |     	     	       	      	ISR     IRR
       | before apic_soft_disable()    0	      1
       | after apic_soft_disable()     0	      1
       |
       | On startup		      		 0	      0
       |
       | Why?
       |
       | Once the dying CPU reenables interrupts the pending interrupt gets
       | delivered as a spurious interupt and then the state is clear.
       |
       | While that CPU0 hotplug test case is surely an esoteric issue, the APIC
       | emulation is still wrong, Even if the play_dead() code would not enable
       | interrupts then the pending IRR bit would turn into an ISR .. interrupt
       | when the APIC is reenabled on startup.
      
      From SDM 10.4.7.2 Local APIC State After It Has Been Software Disabled
      * Pending interrupts in the IRR and ISR registers are held and require
        masking or handling by the CPU.
      
      In Thomas's testing, hardware cpu will not respect soft disable LAPIC
      when IRR has already been set or APICv posted-interrupt is in flight,
      so we can skip soft disable APIC checking when clearing IRR and set ISR,
      continue to respect soft disable APIC when attempting to set IRR.
      Reported-by: NRong Chen <rong.a.chen@intel.com>
      Reported-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb34e690
    • L
      KVM: nVMX: Change KVM_STATE_NESTED_EVMCS to signal vmcs12 is copied from eVMCS · 323d73a8
      Liran Alon 提交于
      Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS
      capability is enabled on vCPU.
      As indicated by vmx->nested.enlightened_vmcs_enabled.
      
      This is quite bizarre as userspace VMM should make sure to expose
      same vCPU with same CPUID values in both source and destination.
      In case vCPU is exposed with eVMCS support on CPUID, it is also
      expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability.
      Therefore, KVM_STATE_NESTED_EVMCS is redundant.
      
      KVM_STATE_NESTED_EVMCS is currently used on restore path
      (vmx_set_nested_state()) only to enable eVMCS capability in KVM
      and to signal need_vmcs12_sync such that on next VMEntry to guest
      nested_sync_from_vmcs12() will be called to sync vmcs12 content
      into eVMCS in guest memory.
      However, because restore nested-state is rare enough, we could
      have just modified vmx_set_nested_state() to always signal
      need_vmcs12_sync.
      
      From all the above, it seems that we could have just removed
      the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve
      backwards migration compatibility, we cannot do that.
      (vmx_get_nested_state() needs to signal flag when migrating from
      new kernel to old kernel).
      
      Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled
      have a bad side-effect of userspace VMM having to send nested-state
      from source to destination as part of migration stream. Even if
      guest have never used eVMCS as it doesn't even run a nested
      hypervisor workload. This requires destination userspace VMM and
      KVM to support setting nested-state. Which make it more difficult
      to migrate from new host to older host.
      To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is
      not only enabled but also active. i.e. Guest have made some
      eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied
      from eVMCS and therefore should be restored into eVMCS resident
      in memory (by copy_vmcs12_to_enlightened()).
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NMaran Wilson <maran.wilson@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      323d73a8
    • L
      KVM: nVMX: Allow restore nested-state to enable eVMCS when vCPU in SMM · 65b712f1
      Liran Alon 提交于
      As comment in code specifies, SMM temporarily disables VMX so we cannot
      be in guest mode, nor can VMLAUNCH/VMRESUME be pending.
      
      However, code currently assumes that these are the only flags that can be
      set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS
      can also be set on this field to signal that eVMCS should be enabled.
      
      Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME
      explicitly.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      65b712f1
    • P
      KVM: x86: degrade WARN to pr_warn_ratelimited · 3f16a5c3
      Paolo Bonzini 提交于
      This warning can be triggered easily by userspace, so it should certainly not
      cause a panic if panic_on_warn is set.
      
      Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
      Suggested-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f16a5c3
    • J
      kvm: x86: Pass through AMD_STIBP_ALWAYS_ON in GET_SUPPORTED_CPUID · c550505b
      Jim Mattson 提交于
      This bit is purely advisory. Passing it through to the guest indicates
      that the virtual processor, like the physical processor, prefers that
      STIBP is only set once during boot and not changed.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c550505b
    • J
      kvm: nVMX: Remove unnecessary sync_roots from handle_invept · b1190198
      Jim Mattson 提交于
      When L0 is executing handle_invept(), the TDP MMU is active. Emulating
      an L1 INVEPT does require synchronizing the appropriate shadow EPT
      root(s), but a call to kvm_mmu_sync_roots in this context won't do
      that. Similarly, the hardware TLB and paging-structure-cache entries
      associated with the appropriate shadow EPT root(s) must be flushed,
      but requesting a TLB_FLUSH from this context won't do that either.
      
      How did this ever work? KVM always does a sync_roots and TLB flush (in
      the correct context) when transitioning from L1 to L2. That isn't the
      best choice for nested VM performance, but it effectively papers over
      the mistakes here.
      
      Remove the unnecessary operations and leave a comment to try to do
      better in the future.
      Reported-by: NJunaid Shahid <junaids@google.com>
      Fixes: bfd0a56b ("nEPT: Nested INVEPT")
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Nadav Har'El <nyh@il.ibm.com>
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Cc: Xinhao Xu <xinhao.xu@intel.com>
      Cc: Yang Zhang <yang.z.zhang@Intel.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Reviewed-by Peter Shier <pshier@google.com>
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b1190198
    • W
      KVM: X86: Expose PV_SCHED_YIELD CPUID feature bit to guest · 32b72ecc
      Wanpeng Li 提交于
      Expose PV_SCHED_YIELD feature bit to guest, the guest can check this
      feature bit before using paravirtualized sched yield.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      32b72ecc
    • W
      KVM: X86: Implement PV sched yield hypercall · 71506297
      Wanpeng Li 提交于
      The target vCPUs are in runnable state after vcpu_kick and suitable
      as a yield target. This patch implements the sched yield hypercall.
      
      17% performance increasement of ebizzy benchmark can be observed in an
      over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush
      call-function IPI-many since call-function is not easy to be trigged
      by userspace workload).
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71506297
    • V
      x86/kvm/nVMX: fix VMCLEAR when Enlightened VMCS is in use · 11e34914
      Vitaly Kuznetsov 提交于
      When Enlightened VMCS is in use, it is valid to do VMCLEAR and,
      according to TLFS, this should "transition an enlightened VMCS from the
      active to the non-active state". It is, however, wrong to assume that
      it is only valid to do VMCLEAR for the eVMCS which is currently active
      on the vCPU performing VMCLEAR.
      
      Currently, the logic in handle_vmclear() is broken: in case, there is no
      active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal'
      VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly
      corrupts the memory area.
      
      So, in case the VMCLEAR argument is not the current active eVMCS on the
      vCPU, how can we know if the area it is pointing to is a normal or an
      enlightened VMCS?
      Thanks to the bug in Hyper-V (see commit 72aeb60c ("KVM: nVMX: Verify
      eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can
      not, the revision can't be used to distinguish between them. So let's
      assume it is always enlightened in case enlightened vmentry is enabled in
      the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to
      minimize the impact for 'unenlightened' workloads.
      
      Fixes: b8bbab92 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11e34914
    • V
      x86/KVM/nVMX: don't use clean fields data on enlightened VMLAUNCH · a21a39c2
      Vitaly Kuznetsov 提交于
      Apparently, Windows doesn't maintain clean fields data after it does
      VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME.
      The issue went unnoticed because currently we do nested_release_evmcs()
      in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates
      clean fields when a new eVMCS is mapped but we're going to change the
      logic.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a21a39c2
  8. 02 7月, 2019 3 次提交
  9. 22 6月, 2019 1 次提交
  10. 21 6月, 2019 1 次提交
    • P
      KVM: nVMX: reorganize initial steps of vmx_set_nested_state · 9fd58877
      Paolo Bonzini 提交于
      Commit 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS
      state before setting new state", 2019-05-02) broke evmcs_test because the
      eVMCS setup must be performed even if there is no VMXON region defined,
      as long as the eVMCS bit is set in the assist page.
      
      While the simplest possible fix would be to add a check on
      kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that
      covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly.
      
      Instead, this patch moves checks earlier in the function and
      conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that
      vmx_set_nested_state always goes through vmx_leave_nested
      and nested_enable_evmcs.
      
      Fixes: 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state")
      Cc: Aaron Lewis <aaronlewis@google.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fd58877
  11. 20 6月, 2019 2 次提交