1. 15 12月, 2020 9 次提交
  2. 15 11月, 2020 6 次提交
    • J
      kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions · 2259c17f
      Jim Mattson 提交于
      On emulated VM-entry and VM-exit, update the CPUID bits that reflect
      CR4.OSXSAVE and CR4.PKE.
      
      This fixes a bug where the CPUID bits could continue to reflect L2 CR4
      values after emulated VM-exit to L1. It also fixes a related bug where
      the CPUID bits could continue to reflect L1 CR4 values after emulated
      VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
      CPUID is not a required intercept. However, it could also be relevant
      to VMX, because the code to conditionally update these CPUID bits
      assumes that the guest CPUID and the guest CR4 are always in sync.
      
      Fixes: 8eb3f87d ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
      Fixes: 2acf923e ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Reported-by: NAbhiroop Dabral <adabral@paloaltonetworks.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Cc: Dexuan Cui <dexuan.cui@intel.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Message-Id: <20201029170648.483210-1-jmattson@google.com>
      2259c17f
    • P
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu 提交于
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      harvesting.
      
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      
      [1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: NLei Cao <lei.cao@stratus.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb04a1ed
    • P
      KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] · ff5a983c
      Peter Xu 提交于
      Originally, we have three code paths that can dirty a page without
      vcpu context for X86:
      
        - init_rmode_identity_map
        - init_rmode_tss
        - kvmgt_rw_gpa
      
      init_rmode_identity_map and init_rmode_tss will be setup on
      destination VM no matter what (and the guest cannot even see them), so
      it does not make sense to track them at all.
      
      To do this, allow __x86_set_memory_region() to return the userspace
      address that just allocated to the caller.  Then in both of the
      functions we directly write to the userspace address instead of
      calling kvm_write_*() APIs.
      
      Another trivial change is that we don't need to explicitly clear the
      identity page table root in init_rmode_identity_map() because no
      matter what we'll write to the whole page with 4M huge page entries.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-4-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff5a983c
    • V
      KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl · c21d54f0
      Vitaly Kuznetsov 提交于
      KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
      independent from vCPU and in some cases VMMs may want to use it as a system
      ioctl instead. In particular, QEMU doesn CPU feature expansion before any
      vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.
      
      Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
      same meaning.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c21d54f0
    • S
      KVM: x86: Return bool instead of int for CR4 and SREGS validity checks · ee69c92b
      Sean Christopherson 提交于
      Rework the common CR4 and SREGS checks to return a bool instead of an
      int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
      clarify the polarity of the return value (which is effectively inverted
      by this change).
      
      No functional changed intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee69c92b
    • S
      KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4
      Sean Christopherson 提交于
      Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
      and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
      KVM_SET_SREGS would return success while failing to actually set CR4.
      
      Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
      in __set_sregs() is not a viable option as KVM has already stuffed a
      variety of vCPU state.
      
      Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
      inverted semantics.  This will be remedied in a future patch.
      
      Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2fe3cd4
  3. 13 11月, 2020 1 次提交
    • B
      KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a
      Babu Moger 提交于
      SEV guests fail to boot on a system that supports the PCID feature.
      
      While emulating the RSM instruction, KVM reads the guest CR3
      and calls kvm_set_cr3(). If the vCPU is in the long mode,
      kvm_set_cr3() does a sanity check for the CR3 value. In this case,
      it validates whether the value has any reserved bits set. The
      reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
      encryption is enabled, the memory encryption bit is set in the CR3
      value. The memory encryption bit may fall within the KVM reserved
      bit range, causing the KVM emulation failure.
      
      Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
      cache the reserved bits in the CR3 value. This will be initialized
      to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).
      
      If the architecture has any special bits(like AMD SEV encryption bit)
      that needs to be masked from the reserved bits, should be cleared
      in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.
      
      Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0107973a
  4. 08 11月, 2020 5 次提交
  5. 31 10月, 2020 1 次提交
  6. 22 10月, 2020 9 次提交
  7. 28 9月, 2020 9 次提交
    • P
      KVM: x86: do not attempt TSC synchronization on guest writes · 0c899c25
      Paolo Bonzini 提交于
      KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
      the same base for the TSC.  This logic is complicated, and we
      do not want it to have any effect once the VM is started.
      
      In particular, if any guest started to synchronize its TSCs
      with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
      the additional effect of kvm_write_tsc code would be uncharted
      territory.
      
      Therefore, this patch makes writes to MSR_IA32_TSC behave
      essentially the same as writes to MSR_IA32_TSC_ADJUST when
      they come from the guest.  A new selftest (which passes
      both before and after the patch) checks the current semantics
      of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
      from both the host and the guest.
      
      Upcoming work to remove the special side effects
      of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
      will be able to build onto this test, adjusting the host side
      to use the new APIs and achieve the same effect.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c899c25
    • P
      KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES · 729c15c2
      Paolo Bonzini 提交于
      We are going to use it for SVM too, so use a more generic name.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      729c15c2
    • A
      KVM: x86: Introduce MSR filtering · 1a155254
      Alexander Graf 提交于
      It's not desireable to have all MSRs always handled by KVM kernel space. Some
      MSRs would be useful to handle in user space to either emulate behavior (like
      uCode updates) or differentiate whether they are valid based on the CPU model.
      
      To allow user space to specify which MSRs it wants to see handled by KVM,
      this patch introduces a new ioctl to push filter rules with bitmaps into
      KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
      With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
      denied MSR events to user space to operate on.
      
      If no filter is populated, MSR handling stays identical to before.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-8-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a155254
    • A
      KVM: x86: Add infrastructure for MSR filtering · 51de8151
      Alexander Graf 提交于
      In the following commits we will add pieces of MSR filtering.
      To ensure that code compiles even with the feature half-merged, let's add
      a few stubs and struct definitions before the real patches start.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-4-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51de8151
    • A
      KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954
      Alexander Graf 提交于
      MSRs are weird. Some of them are normal control registers, such as EFER.
      Some however are registers that really are model specific, not very
      interesting to virtualization workloads, and not performance critical.
      Others again are really just windows into package configuration.
      
      Out of these MSRs, only the first category is necessary to implement in
      kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
      certain CPU models and MSRs that contain information on the package level
      are much better suited for user space to process. However, over time we have
      accumulated a lot of MSRs that are not the first category, but still handled
      by in-kernel KVM code.
      
      This patch adds a generic interface to handle WRMSR and RDMSR from user
      space. With this, any future MSR that is part of the latter categories can
      be handled in user space.
      
      Furthermore, it allows us to replace the existing "ignore_msrs" logic with
      something that applies per-VM rather than on the full system. That way you
      can run productive VMs in parallel to experimental ones where you don't care
      about proper MSR handling.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200925143422.21718-3-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ae09954
    • A
      KVM: x86: Return -ENOENT on unimplemented MSRs · 90218e43
      Alexander Graf 提交于
      When we find an MSR that we can not handle, bubble up that error code as
      MSR error return code. Follow up patches will use that to expose the fact
      that an MSR is not handled by KVM to user space.
      Suggested-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Message-Id: <20200925143422.21718-2-graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90218e43
    • S
      KVM: x86: Rename "shared_msrs" to "user_return_msrs" · 7e34fbd0
      Sean Christopherson 提交于
      Rename the "shared_msrs" mechanism, which is used to defer restoring
      MSRs that are only consumed when running in userspace, to a more banal
      but less likely to be confusing "user_return_msrs".
      
      The "shared" nomenclature is confusing as it's not obvious who is
      sharing what, e.g. reasonable interpretations are that the guest value
      is shared by vCPUs in a VM, or that the MSR value is shared/common to
      guest and host, both of which are wrong.
      
      "shared" is also misleading as the MSR value (in hardware) is not
      guaranteed to be shared/reused between VMs (if that's indeed the correct
      interpretation of the name), as the ability to share values between VMs
      is simply a side effect (albiet a very nice side effect) of deferring
      restoration of the host value until returning from userspace.
      
      "user_return" avoids the above confusion by describing the mechanism
      itself instead of its effects.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e34fbd0
    • S
      KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepoint · b2d52255
      Sean Christopherson 提交于
      Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
      tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
      tracepoint won't be hit.
      
      Read RIP from within the tracepoint itself to avoid a potential VMREAD
      and retpoline if the guest's RIP isn't available.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2d52255
    • S
      KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1
      Sean Christopherson 提交于
      Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
      more generic is_emulatable(), and unconditionally call the new function
      in x86_emulate_instruction().
      
      KVM will use the generic hook to support multiple security related
      technologies that prevent emulation in one way or another.  Similar to
      the existing AMD #NPF case where emulation of the current instruction is
      not possible due to lack of information, AMD's SEV-ES and Intel's SGX
      and TDX will introduce scenarios where emulation is impossible due to
      the guest's register state being inaccessible.  And again similar to the
      existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
      i.e. outside of the control of vendor-specific code.
      
      While the cause and architecturally visible behavior of the various
      cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
      resume or complete shutdown, and SEV-ES and TDX "return" an error, the
      impact on the common emulation code is identical: KVM must stop
      emulation immediately and resume the guest.
      
      Query is_emulatable() in handle_ud() as well so that the
      force_emulation_prefix code doesn't incorrectly modify RIP before
      calling emulate_instruction() in the absurdly unlikely scenario that
      KVM encounters forced emulation in conjunction with "do not emulate".
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09e3e2a1