1. 19 4月, 2021 1 次提交
  2. 09 3月, 2021 3 次提交
  3. 09 2月, 2021 3 次提交
  4. 12 1月, 2021 2 次提交
  5. 27 11月, 2020 1 次提交
    • P
      KVM: x86: Fix split-irqchip vs interrupt injection window request · 71cc849b
      Paolo Bonzini 提交于
      kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
      a hodge-podge of conditions, hacked together to get something that
      more or less works.  But what is actually needed is much simpler;
      in both cases the fundamental question is, do we have a place to stash
      an interrupt if userspace does KVM_INTERRUPT?
      
      In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
      Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
      unnecessarily restrictive.
      
      In split irqchip mode it's a bit more complicated, we need to check
      kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
      cycle and thus requires ExtINTs not to be masked) as well as
      !pending_userspace_extint(vcpu).  However, there is no need to
      check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
      pending ExtINT state separate from event injection state, and checking
      kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
      priority than APIC interrupts.  In fact the latter fixes a bug:
      when userspace requests an IRQ window vmexit, an interrupt in the
      local APIC can cause kvm_cpu_has_interrupt() to be true and thus
      kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
      happens, vcpu_run does not exit to userspace but the interrupt window
      vmexits keep occurring.  The VM loops without any hope of making progress.
      
      Once we try to fix these with something like
      
           return kvm_arch_interrupt_allowed(vcpu) &&
      -        !kvm_cpu_has_interrupt(vcpu) &&
      -        !kvm_event_needs_reinjection(vcpu) &&
      -        kvm_cpu_accept_dm_intr(vcpu);
      +        (!lapic_in_kernel(vcpu)
      +         ? !vcpu->arch.interrupt.injected
      +         : (kvm_apic_accept_pic_intr(vcpu)
      +            && !pending_userspace_extint(v)));
      
      we realize two things.  First, thanks to the previous patch the complex
      conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
      window request in vcpu_enter_guest()
      
              bool req_int_win =
                      dm_request_for_irq_injection(vcpu) &&
                      kvm_cpu_accept_dm_intr(vcpu);
      
      should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
      it is unnecessary to ask the processor for an interrupt window
      if we would not be able to return to userspace.  Therefore,
      kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
      ANDed with the existing check for masked ExtINT.  It all makes sense:
      
      - we can accept an interrupt from userspace if there is a place
        to stash it (and, for irqchip split, ExtINTs are not masked).
        Interrupts from userspace _can_ be accepted even if right now
        EFLAGS.IF=0.
      
      - in order to tell userspace we will inject its interrupt ("IRQ
        window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
        KVM and the vCPU need to be ready to accept the interrupt.
      
      ... and this is what the patch implements.
      Reported-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Analyzed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Tested-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      71cc849b
  6. 13 11月, 2020 1 次提交
    • B
      KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a
      Babu Moger 提交于
      SEV guests fail to boot on a system that supports the PCID feature.
      
      While emulating the RSM instruction, KVM reads the guest CR3
      and calls kvm_set_cr3(). If the vCPU is in the long mode,
      kvm_set_cr3() does a sanity check for the CR3 value. In this case,
      it validates whether the value has any reserved bits set. The
      reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
      encryption is enabled, the memory encryption bit is set in the CR3
      value. The memory encryption bit may fall within the KVM reserved
      bit range, causing the KVM emulation failure.
      
      Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
      cache the reserved bits in the CR3 value. This will be initialized
      to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).
      
      If the architecture has any special bits(like AMD SEV encryption bit)
      that needs to be masked from the reserved bits, should be cleared
      in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.
      
      Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0107973a
  7. 08 11月, 2020 5 次提交
  8. 31 10月, 2020 1 次提交
  9. 22 10月, 2020 9 次提交
  10. 28 9月, 2020 13 次提交
    • P
      KVM: x86: do not attempt TSC synchronization on guest writes · 0c899c25
      Paolo Bonzini 提交于
      KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
      the same base for the TSC.  This logic is complicated, and we
      do not want it to have any effect once the VM is started.
      
      In particular, if any guest started to synchronize its TSCs
      with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
      the additional effect of kvm_write_tsc code would be uncharted
      territory.
      
      Therefore, this patch makes writes to MSR_IA32_TSC behave
      essentially the same as writes to MSR_IA32_TSC_ADJUST when
      they come from the guest.  A new selftest (which passes
      both before and after the patch) checks the current semantics
      of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
      from both the host and the guest.
      
      Upcoming work to remove the special side effects
      of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
      will be able to build onto this test, adjusting the host side
      to use the new APIs and achieve the same effect.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c899c25
    • P
      KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES · 729c15c2
      Paolo Bonzini 提交于
      We are going to use it for SVM too, so use a more generic name.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      729c15c2
    • A
      KVM: x86: Introduce MSR filtering · 1a155254
      Alexander Graf 提交于
      It's not desireable to have all MSRs always handled by KVM kernel space. Some
      MSRs would be useful to handle in user space to either emulate behavior (like
      uCode updates) or differentiate whether they are valid based on the CPU model.
      
      To allow user space to specify which MSRs it wants to see handled by KVM,
      this patch introduces a new ioctl to push filter rules with bitmaps into
      KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
      With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
      denied MSR events to user space to operate on.
      
      If no filter is populated, MSR handling stays identical to before.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-8-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a155254
    • A
      KVM: x86: Add infrastructure for MSR filtering · 51de8151
      Alexander Graf 提交于
      In the following commits we will add pieces of MSR filtering.
      To ensure that code compiles even with the feature half-merged, let's add
      a few stubs and struct definitions before the real patches start.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-4-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51de8151
    • A
      KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954
      Alexander Graf 提交于
      MSRs are weird. Some of them are normal control registers, such as EFER.
      Some however are registers that really are model specific, not very
      interesting to virtualization workloads, and not performance critical.
      Others again are really just windows into package configuration.
      
      Out of these MSRs, only the first category is necessary to implement in
      kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
      certain CPU models and MSRs that contain information on the package level
      are much better suited for user space to process. However, over time we have
      accumulated a lot of MSRs that are not the first category, but still handled
      by in-kernel KVM code.
      
      This patch adds a generic interface to handle WRMSR and RDMSR from user
      space. With this, any future MSR that is part of the latter categories can
      be handled in user space.
      
      Furthermore, it allows us to replace the existing "ignore_msrs" logic with
      something that applies per-VM rather than on the full system. That way you
      can run productive VMs in parallel to experimental ones where you don't care
      about proper MSR handling.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200925143422.21718-3-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ae09954
    • A
      KVM: x86: Return -ENOENT on unimplemented MSRs · 90218e43
      Alexander Graf 提交于
      When we find an MSR that we can not handle, bubble up that error code as
      MSR error return code. Follow up patches will use that to expose the fact
      that an MSR is not handled by KVM to user space.
      Suggested-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Message-Id: <20200925143422.21718-2-graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90218e43
    • S
      KVM: x86: Rename "shared_msrs" to "user_return_msrs" · 7e34fbd0
      Sean Christopherson 提交于
      Rename the "shared_msrs" mechanism, which is used to defer restoring
      MSRs that are only consumed when running in userspace, to a more banal
      but less likely to be confusing "user_return_msrs".
      
      The "shared" nomenclature is confusing as it's not obvious who is
      sharing what, e.g. reasonable interpretations are that the guest value
      is shared by vCPUs in a VM, or that the MSR value is shared/common to
      guest and host, both of which are wrong.
      
      "shared" is also misleading as the MSR value (in hardware) is not
      guaranteed to be shared/reused between VMs (if that's indeed the correct
      interpretation of the name), as the ability to share values between VMs
      is simply a side effect (albiet a very nice side effect) of deferring
      restoration of the host value until returning from userspace.
      
      "user_return" avoids the above confusion by describing the mechanism
      itself instead of its effects.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e34fbd0
    • S
      KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepoint · b2d52255
      Sean Christopherson 提交于
      Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
      tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
      tracepoint won't be hit.
      
      Read RIP from within the tracepoint itself to avoid a potential VMREAD
      and retpoline if the guest's RIP isn't available.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2d52255
    • S
      KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1
      Sean Christopherson 提交于
      Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
      more generic is_emulatable(), and unconditionally call the new function
      in x86_emulate_instruction().
      
      KVM will use the generic hook to support multiple security related
      technologies that prevent emulation in one way or another.  Similar to
      the existing AMD #NPF case where emulation of the current instruction is
      not possible due to lack of information, AMD's SEV-ES and Intel's SGX
      and TDX will introduce scenarios where emulation is impossible due to
      the guest's register state being inaccessible.  And again similar to the
      existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
      i.e. outside of the control of vendor-specific code.
      
      While the cause and architecturally visible behavior of the various
      cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
      resume or complete shutdown, and SEV-ES and TDX "return" an error, the
      impact on the common emulation code is identical: KVM must stop
      emulation immediately and resume the guest.
      
      Query is_emulatable() in handle_ud() as well so that the
      force_emulation_prefix code doesn't incorrectly modify RIP before
      calling emulate_instruction() in the absurdly unlikely scenario that
      KVM encounters forced emulation in conjunction with "do not emulate".
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09e3e2a1
    • M
      KVM: x86: fix MSR_IA32_TSC read for nested migration · cc5b54dd
      Maxim Levitsky 提交于
      MSR reads/writes should always access the L1 state, since the (nested)
      hypervisor should intercept all the msrs it wants to adjust, and these
      that it doesn't should be read by the guest as if the host had read it.
      
      However IA32_TSC is an exception. Even when not intercepted, guest still
      reads the value + TSC offset.
      The write however does not take any TSC offset into account.
      
      This is documented in Intel's SDM and seems also to happen on AMD as well.
      
      This creates a problem when userspace wants to read the IA32_TSC value and then
      write it. (e.g for migration)
      
      In this case it reads L2 value but write is interpreted as an L1 value.
      To fix this make the userspace initiated reads of IA32_TSC return L1 value
      as well.
      
      Huge thanks to Dave Gilbert for helping me understand this very confusing
      semantic of MSR writes.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cc5b54dd
    • B
      KVM: X86: Move handling of INVPCID types to x86 · 9715092f
      Babu Moger 提交于
      INVPCID instruction handling is mostly same across both VMX and
      SVM. So, move the code to common x86.c.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9715092f
    • B
      KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c · 3f3393b3
      Babu Moger 提交于
      Handling of kvm_read/write_guest_virt*() errors can be moved to common
      code. The same code can be used by both VMX and SVM.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f3393b3
    • W
      KVM: LAPIC: Narrow down the kick target vCPU · 68ca7663
      Wanpeng Li 提交于
      The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
      fires on a different pCPU which vCPU is running on.  This kick costs about
      1000 clock cycles and we don't need this when injecting already-expired
      timer or when using the VMX preemption timer because
      kvm_lapic_expired_hv_timer() is called from the target vCPU.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      68ca7663
  11. 25 9月, 2020 1 次提交