1. 12 9月, 2019 1 次提交
  2. 11 9月, 2019 5 次提交
    • J
      KVM: x86: always stop emulation on page fault · 8530a79c
      Jan Dakinevich 提交于
      inject_emulated_exception() returns true if and only if nested page
      fault happens. However, page fault can come from guest page tables
      walk, either nested or not nested. In both cases we should stop an
      attempt to read under RIP and give guest to step over its own page
      fault handler.
      
      This is also visible when an emulated instruction causes a #GP fault
      and the VMware backdoor is enabled.  To handle the VMware backdoor,
      KVM intercepts #GP faults; with only the next patch applied,
      x86_emulate_instruction() injects a #GP but returns EMULATE_FAIL
      instead of EMULATE_DONE.   EMULATE_FAIL causes handle_exception_nmi()
      (or gp_interception() for SVM) to re-inject the original #GP because it
      thinks emulation failed due to a non-VMware opcode.  This patch prevents
      the issue as x86_emulate_instruction() will return EMULATE_DONE after
      injecting the #GP.
      
      Fixes: 6ea6e843 ("KVM: x86: inject exceptions produced by x86_decode_insn")
      Cc: stable@vger.kernel.org
      Cc: Denis Lunev <den@virtuozzo.com>
      Cc: Roman Kagan <rkagan@virtuozzo.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Signed-off-by: NJan Dakinevich <jan.dakinevich@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8530a79c
    • S
      KVM: nVMX: add tracepoint for failed nested VM-Enter · 5497b955
      Sean Christopherson 提交于
      Debugging a failed VM-Enter is often like searching for a needle in a
      haystack, e.g. there are over 80 consistency checks that funnel into
      the "invalid control field" error code.  One way to expedite debug is
      to run the buggy code as an L1 guest under KVM (and pray that the
      failing check is detected by KVM).  However, extracting useful debug
      information out of L0 KVM requires attaching a debugger to KVM and/or
      modifying the source, e.g. to log which check is failing.
      
      Make life a little less painful for VMM developers and add a tracepoint
      for failed VM-Enter consistency checks.  Ideally the tracepoint would
      capture both what check failed and precisely why it failed, but logging
      why a checked failed is difficult to do in a generic tracepoint without
      resorting to invasive techniques, e.g. generating a custom string on
      failure.  That being said, for the vast majority of VM-Enter failures
      the most difficult step is figuring out exactly what to look at, e.g.
      figuring out which bit was incorrectly set in a control field is usually
      not too painful once the guilty field as been identified.
      
      To reach a happy medium between precision and ease of use, simply log
      the code that detected a failed check, using a macro to execute the
      check and log the trace event on failure.  This approach enables tracing
      arbitrary code, e.g. it's not limited to function calls or specific
      formats of checks, and the changes to the existing code are minimally
      invasive.  A macro with a two-character name is desirable as usage of
      the macro doesn't result in overly long lines or confusing alignment,
      while still retaining some amount of readability.  I.e. a one-character
      name is a little too terse, and a three-character name results in the
      contents being passed to the macro aligning with an indented line when
      the macro is used an in if-statement, e.g.:
      
              if (VCC(nested_vmx_check_long_line_one(...) &&
                      nested_vmx_check_long_line_two(...)))
                      return -EINVAL;
      
      And that is the story of how the CC(), a.k.a. Consistency Check, macro
      got its name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5497b955
    • S
      KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code · 1edce0a9
      Sean Christopherson 提交于
      Move RDMSR and WRMSR emulation into common x86 code to consolidate
      nearly identical SVM and VMX code.
      
      Note, consolidating RDMSR introduces an extra indirect call, i.e.
      retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a
      guest kernel likely has bigger problems if increasing the latency of
      RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM
      performance.  E.g. the only recurring RDMSR VM-Exits (after booting) on
      my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via
      arch_cpu_idle_enter().
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1edce0a9
    • S
      KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers · f20935d8
      Sean Christopherson 提交于
      Refactor the top-level MSR accessors to take/return the index and value
      directly instead of requiring the caller to dump them into a msr_data
      struct.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f20935d8
    • P
      KVM: X86: Tune PLE Window tracepoint · 4f75bcc3
      Peter Xu 提交于
      The PLE window tracepoint triggers even if the window is not changed,
      and the wording can be a bit confusing too.  One example line:
      
        kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)
      
      It easily let people think of "the window now is 4096 which is
      shrinked", but the truth is the value actually didn't change (4096).
      
      Let's only dump this message if the value really changed, and we make
      the message even simpler like:
      
        kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f75bcc3
  3. 10 9月, 2019 1 次提交
    • S
      KVM: x86: Manually calculate reserved bits when loading PDPTRS · 16cfacc8
      Sean Christopherson 提交于
      Manually generate the PDPTR reserved bit mask when explicitly loading
      PDPTRs.  The reserved bits that are being tracked by the MMU reflect the
      current paging mode, which is unlikely to be PAE paging in the vast
      majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation,
      __set_sregs(), etc...  This can cause KVM to incorrectly signal a bad
      PDPTR, or more likely, miss a reserved bit check and subsequently fail
      a VM-Enter due to a bad VMCS.GUEST_PDPTR.
      
      Add a one off helper to generate the reserved bits instead of sharing
      code across the MMU's calculations and the PDPTR emulation.  The PDPTR
      reserved bits are basically set in stone, and pushing a helper into
      the MMU's calculation adds unnecessary complexity without improving
      readability.
      
      Oppurtunistically fix/update the comment for load_pdptrs().
      
      Note, the buggy commit also introduced a deliberate functional change,
      "Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was
      effectively (and correctly) reverted by commit cd9ae5fe ("KVM: x86:
      Fix page-tables reserved bits").  A bit of SDM archaeology shows that
      the SDM from late 2008 had a bug (likely a copy+paste error) where it
      listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved
      for 2mb entries.  I.e. the SDM contradicted itself, and bits 6:5 are and
      always have been reserved.
      
      Fixes: 20c466b5 ("KVM: Use rsvd_bits_mask in load_pdptrs()")
      Cc: stable@vger.kernel.org
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Reported-by: NDoug Reiland <doug.reiland@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16cfacc8
  4. 22 8月, 2019 7 次提交
  5. 05 8月, 2019 1 次提交
    • W
      KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5
      Wanpeng Li 提交于
      After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
      five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
      on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
      in the VMs after stress testing:
      
       INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
       Call Trace:
         flush_tlb_mm_range+0x68/0x140
         tlb_flush_mmu.part.75+0x37/0xe0
         tlb_finish_mmu+0x55/0x60
         zap_page_range+0x142/0x190
         SyS_madvise+0x3cd/0x9c0
         system_call_fastpath+0x1c/0x21
      
      swait_active() sustains to be true before finish_swait() is called in
      kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
      by kvm_vcpu_on_spin() loop greatly increases the probability condition
      kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
      is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
      vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
      VMCS.
      
      This patch fixes it by checking conservatively a subset of events.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Marc Zyngier <Marc.Zyngier@arm.com>
      Cc: stable@vger.kernel.org
      Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      17e433b5
  6. 24 7月, 2019 1 次提交
    • W
      KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption · 266e85a5
      Wanpeng Li 提交于
      Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
      introduces hybrid PV queued/unfair locks
       - queued mode (no starvation)
       - unfair mode (good performance on not heavily contended lock)
      The lock waiter goes into the unfair mode especially in VMs with over-commit
      vCPUs since increaing over-commitment increase the likehood that the queue
      head vCPU may have been preempted and not actively spinning.
      
      However, reschedule queue head vCPU timely to acquire the lock still can get
      better performance than just depending on lock stealing in over-subscribe
      scenario.
      
      Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
      ebizzy -M
                   vanilla     boosting    improved
       1VM          23520        25040         6%
       2VM           8000        13600        70%
       3VM           3100         5400        74%
      
      The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
      head vCPU which is involuntary preemption or the one which is voluntary halt
      due to fail to acquire the lock after a short spin in the guest.
      
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      266e85a5
  7. 22 7月, 2019 3 次提交
    • W
      KVM: X86: Dynamically allocate user_fpu · d9a710e5
      Wanpeng Li 提交于
      After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
      for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
      is the order at which allocations are deemed costly to service. In serveless
      scenario, one host can service hundreds/thoudands firecracker/kata-container
      instances, howerver, new instance will fail to launch after memory is too
      fragmented to allocate kvm_vcpu struct on host, this was observed in some
      cloud provider product environments.
      
      This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
      Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9a710e5
    • W
      KVM: X86: Fix fpu state crash in kvm guest · e7517324
      Wanpeng Li 提交于
      The idea before commit 240c35a3 (which has just been reverted)
      was that we have the following FPU states:
      
                     userspace (QEMU)             guest
      ---------------------------------------------------------------------------
                     processor                    vcpu->arch.guest_fpu
      >>> KVM_RUN: kvm_load_guest_fpu
                     vcpu->arch.user_fpu          processor
      >>> preempt out
                     vcpu->arch.user_fpu          current->thread.fpu
      >>> preempt in
                     vcpu->arch.user_fpu          processor
      >>> back to userspace
      >>> kvm_put_guest_fpu
                     processor                    vcpu->arch.guest_fpu
      ---------------------------------------------------------------------------
      
      With the new lazy model we want to get the state back to the processor
      when schedule in from current->thread.fpu.
      Reported-by: NThomas Lambertz <mail@thomaslambertz.de>
      Reported-by: Nanthony <antdev66@gmail.com>
      Tested-by: Nanthony <antdev66@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Lambertz <mail@thomaslambertz.de>
      Cc: anthony <antdev66@gmail.com>
      Cc: stable@vger.kernel.org
      Fixes: 5f409e20 (x86/fpu: Defer FPU state load until return to userspace)
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add a comment in front of the warning. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e7517324
    • P
      Revert "kvm: x86: Use task structs fpu field for user" · ec269475
      Paolo Bonzini 提交于
      This reverts commit 240c35a3
      ("kvm: x86: Use task structs fpu field for user", 2018-11-06).
      The commit is broken and causes QEMU's FPU state to be destroyed
      when KVM_RUN is preempted.
      
      Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec269475
  8. 20 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da
      Wanpeng Li 提交于
      Dedicated instances are currently disturbed by unnecessary jitter due
      to the emulated lapic timers firing on the same pCPUs where the
      vCPUs reside.  There is no hardware virtual timer on Intel for guest
      like ARM, so both programming timer in guest and the emulated timer fires
      incur vmexits.  This patch tries to avoid vmexit when the emulated timer
      fires, at least in dedicated instance scenario when nohz_full is enabled.
      
      In that case, the emulated timers can be offload to the nearest busy
      housekeeping cpus since APICv has been found for several years in server
      processors. The guest timer interrupt can then be injected via posted interrupts,
      which are delivered by the housekeeping cpu once the emulated timer fires.
      
      The host should tuned so that vCPUs are placed on isolated physical
      processors, and with several pCPUs surplus for busy housekeeping.
      If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
      ~3% redis performance benefit can be observed on Skylake server, and the
      number of external interrupt vmexits drops substantially.  Without patch
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
      EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )
      
      While with patch:
      
                  VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
      EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c5f81da
  9. 18 7月, 2019 1 次提交
    • W
      KVM: LAPIC: Make lapic timer unpinned · 4d151bf3
      Wanpeng Li 提交于
      Commit 61abdbe0 ("kvm: x86: make lapic hrtimer pinned") pinned the
      lapic timer to avoid to wait until the next kvm exit for the guest to
      see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
      after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
      will be used in follow up patches.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d151bf3
  10. 15 7月, 2019 1 次提交
  11. 11 7月, 2019 2 次提交
    • S
      KVM: x86: Unconditionally enable irqs in guest context · d7a08882
      Sean Christopherson 提交于
      On VMX, KVM currently does not re-enable irqs until after it has exited
      the guest context.  As a result, a tick that fires in the window between
      VM-Exit and guest_exit_irqoff() will be accounted as system time.  While
      said window is relatively small, it's large enough to be problematic in
      some configurations, e.g. if VM-Exits are consistently occurring a hair
      earlier than the tick irq.
      
      Intentionally toggle irqs back off so that guest_exit_irqoff() can be
      used in lieu of guest_exit() in order to avoid the save/restore of flags
      in guest_exit().  On my Haswell system, "nop; cli; sti" is ~6 cycles,
      versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf".
      
      Fixes: f2485b3e ("KVM: x86: use guest_exit_irqoff")
      Reported-by: NWei Yang <w90p710@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d7a08882
    • E
      KVM: x86: PMU Event Filter · 66bb8a06
      Eric Hankland 提交于
      Some events can provide a guest with information about other guests or the
      host (e.g. L3 cache stats); providing the capability to restrict access
      to a "safe" set of events would limit the potential for the PMU to be used
      in any side channel attacks. This change introduces a new VM ioctl that
      sets an event filter. If the guest attempts to program a counter for
      any blacklisted or non-whitelisted event, the kernel counter won't be
      created, so any RDPMC/RDMSR will show 0 instances of that event.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [Lots of changes. All remaining bugs are probably mine. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66bb8a06
  12. 03 7月, 2019 3 次提交
    • M
      clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic · dd2cb348
      Michael Kelley 提交于
      Continue consolidating Hyper-V clock and timer code into an ISA
      independent Hyper-V clocksource driver.
      
      Move the existing clocksource code under drivers/hv and arch/x86 to the new
      clocksource driver while separating out the ISA dependencies. Update
      Hyper-V initialization to call initialization and cleanup routines since
      the Hyper-V synthetic clock is not independently enumerated in ACPI.
      
      Update Hyper-V clocksource users in KVM and VDSO to get definitions from
      the new include file.
      
      No behavior is changed and no new functionality is added.
      Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Cc: "bp@alien8.de" <bp@alien8.de>
      Cc: "will.deacon@arm.com" <will.deacon@arm.com>
      Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
      Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
      Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: "sashal@kernel.org" <sashal@kernel.org>
      Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
      Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
      Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
      Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
      Cc: "arnd@arndb.de" <arnd@arndb.de>
      Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
      Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
      Cc: "paul.burton@mips.com" <paul.burton@mips.com>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "salyzyn@android.com" <salyzyn@android.com>
      Cc: "pcc@google.com" <pcc@google.com>
      Cc: "shuah@kernel.org" <shuah@kernel.org>
      Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
      Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
      Cc: "huw@codeweavers.com" <huw@codeweavers.com>
      Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
      Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
      Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
      Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
      Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com
      dd2cb348
    • P
      KVM: x86: degrade WARN to pr_warn_ratelimited · 3f16a5c3
      Paolo Bonzini 提交于
      This warning can be triggered easily by userspace, so it should certainly not
      cause a panic if panic_on_warn is set.
      
      Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
      Suggested-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f16a5c3
    • W
      KVM: X86: Implement PV sched yield hypercall · 71506297
      Wanpeng Li 提交于
      The target vCPUs are in runnable state after vcpu_kick and suitable
      as a yield target. This patch implements the sched yield hypercall.
      
      17% performance increasement of ebizzy benchmark can be observed in an
      over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush
      call-function IPI-many since call-function is not easy to be trigged
      by userspace workload).
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71506297
  13. 02 7月, 2019 1 次提交
    • P
      KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7
      Paolo Bonzini 提交于
      This allows userspace to know which MSRs are supported by the hypervisor.
      Unfortunately userspace must resort to tricks for everything except
      MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
      One possibility is to use the feature control MSR, which is tied to nested
      VMX as well and is present on all KVM versions that support feature MSRs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95c5c7c7
  14. 22 6月, 2019 1 次提交
  15. 19 6月, 2019 1 次提交
  16. 18 6月, 2019 6 次提交
  17. 14 6月, 2019 1 次提交
    • P
      KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e
      Paolo Bonzini 提交于
      Even when asynchronous page fault is disabled, KVM does not want to pause
      the host if a guest triggers a page fault; instead it will put it into
      an artificial HLT state that allows running other host processes while
      allowing interrupt delivery into the guest.
      
      However, the way this feature is triggered is a bit confusing.
      First, it is not used for page faults while a nested guest is
      running: but this is not an issue since the artificial halt
      is completely invisible to the guest, either L1 or L2.  Second,
      it is used even if kvm_halt_in_guest() returns true; in this case,
      the guest probably should not pay the additional latency cost of the
      artificial halt, and thus we should handle the page fault in a
      completely synchronous way.
      
      By introducing a new function kvm_can_deliver_async_pf, this patch
      commonizes the code that chooses whether to deliver an async page fault
      (kvm_arch_async_page_not_present) and the code that chooses whether a
      page fault should be handled synchronously (kvm_can_do_async_pf).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1dfdb45e
  18. 05 6月, 2019 3 次提交
    • J
      kvm: Convert kvm_lock to a mutex · 0d9ce162
      Junaid Shahid 提交于
      It doesn't seem as if there is any particular need for kvm_lock to be a
      spinlock, so convert the lock to a mutex so that sleepable functions (in
      particular cond_resched()) can be called while holding it.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0d9ce162
    • W
      KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bit · 511a8556
      Wanpeng Li 提交于
      MSR IA32_MISC_ENABLE bit 18, according to SDM:
      
      | When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0).
      | This indicates that MONITOR/MWAIT are not supported.
      |
      | Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0.
      |
      | When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).
      
      The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit,
      CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
      kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
      guest visibility.
      
      This patch implements toggling of the CPUID bit based on guest writes
      to the MSR.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Fixes for backwards compatibility - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      511a8556
    • W
      KVM: X86: Provide a capability to disable cstate msr read intercepts · b5170063
      Wanpeng Li 提交于
      Allow guest reads CORE cstate when exposing host CPU power management capabilities
      to the guest. PKG cstate is restricted to avoid a guest to get the whole package
      information in multi-tenant scenario.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5170063