1. 31 10月, 2020 1 次提交
  2. 22 10月, 2020 9 次提交
  3. 28 9月, 2020 13 次提交
    • P
      KVM: x86: do not attempt TSC synchronization on guest writes · 0c899c25
      Paolo Bonzini 提交于
      KVM special-cases writes to MSR_IA32_TSC so that all CPUs have
      the same base for the TSC.  This logic is complicated, and we
      do not want it to have any effect once the VM is started.
      
      In particular, if any guest started to synchronize its TSCs
      with writes to MSR_IA32_TSC rather than MSR_IA32_TSC_ADJUST,
      the additional effect of kvm_write_tsc code would be uncharted
      territory.
      
      Therefore, this patch makes writes to MSR_IA32_TSC behave
      essentially the same as writes to MSR_IA32_TSC_ADJUST when
      they come from the guest.  A new selftest (which passes
      both before and after the patch) checks the current semantics
      of writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST originating
      from both the host and the guest.
      
      Upcoming work to remove the special side effects
      of host-initiated writes to MSR_IA32_TSC and MSR_IA32_TSC_ADJUST
      will be able to build onto this test, adjusting the host side
      to use the new APIs and achieve the same effect.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0c899c25
    • P
      KVM: x86: rename KVM_REQ_GET_VMCS12_PAGES · 729c15c2
      Paolo Bonzini 提交于
      We are going to use it for SVM too, so use a more generic name.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      729c15c2
    • A
      KVM: x86: Introduce MSR filtering · 1a155254
      Alexander Graf 提交于
      It's not desireable to have all MSRs always handled by KVM kernel space. Some
      MSRs would be useful to handle in user space to either emulate behavior (like
      uCode updates) or differentiate whether they are valid based on the CPU model.
      
      To allow user space to specify which MSRs it wants to see handled by KVM,
      this patch introduces a new ioctl to push filter rules with bitmaps into
      KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
      With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
      denied MSR events to user space to operate on.
      
      If no filter is populated, MSR handling stays identical to before.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-8-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1a155254
    • A
      KVM: x86: Add infrastructure for MSR filtering · 51de8151
      Alexander Graf 提交于
      In the following commits we will add pieces of MSR filtering.
      To ensure that code compiles even with the feature half-merged, let's add
      a few stubs and struct definitions before the real patches start.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      
      Message-Id: <20200925143422.21718-4-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51de8151
    • A
      KVM: x86: Allow deflecting unknown MSR accesses to user space · 1ae09954
      Alexander Graf 提交于
      MSRs are weird. Some of them are normal control registers, such as EFER.
      Some however are registers that really are model specific, not very
      interesting to virtualization workloads, and not performance critical.
      Others again are really just windows into package configuration.
      
      Out of these MSRs, only the first category is necessary to implement in
      kernel space. Rarely accessed MSRs, MSRs that should be fine tunes against
      certain CPU models and MSRs that contain information on the package level
      are much better suited for user space to process. However, over time we have
      accumulated a lot of MSRs that are not the first category, but still handled
      by in-kernel KVM code.
      
      This patch adds a generic interface to handle WRMSR and RDMSR from user
      space. With this, any future MSR that is part of the latter categories can
      be handled in user space.
      
      Furthermore, it allows us to replace the existing "ignore_msrs" logic with
      something that applies per-VM rather than on the full system. That way you
      can run productive VMs in parallel to experimental ones where you don't care
      about proper MSR handling.
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      
      Message-Id: <20200925143422.21718-3-graf@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ae09954
    • A
      KVM: x86: Return -ENOENT on unimplemented MSRs · 90218e43
      Alexander Graf 提交于
      When we find an MSR that we can not handle, bubble up that error code as
      MSR error return code. Follow up patches will use that to expose the fact
      that an MSR is not handled by KVM to user space.
      Suggested-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NAlexander Graf <graf@amazon.com>
      Message-Id: <20200925143422.21718-2-graf@amazon.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      90218e43
    • S
      KVM: x86: Rename "shared_msrs" to "user_return_msrs" · 7e34fbd0
      Sean Christopherson 提交于
      Rename the "shared_msrs" mechanism, which is used to defer restoring
      MSRs that are only consumed when running in userspace, to a more banal
      but less likely to be confusing "user_return_msrs".
      
      The "shared" nomenclature is confusing as it's not obvious who is
      sharing what, e.g. reasonable interpretations are that the guest value
      is shared by vCPUs in a VM, or that the MSR value is shared/common to
      guest and host, both of which are wrong.
      
      "shared" is also misleading as the MSR value (in hardware) is not
      guaranteed to be shared/reused between VMs (if that's indeed the correct
      interpretation of the name), as the ability to share values between VMs
      is simply a side effect (albiet a very nice side effect) of deferring
      restoration of the host value until returning from userspace.
      
      "user_return" avoids the above confusion by describing the mechanism
      itself instead of its effects.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923180409.32255-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e34fbd0
    • S
      KVM: x86: Add RIP to the kvm_entry, i.e. VM-Enter, tracepoint · b2d52255
      Sean Christopherson 提交于
      Add RIP to the kvm_entry tracepoint to help debug if the kvm_exit
      tracepoint is disabled or if VM-Enter fails, in which case the kvm_exit
      tracepoint won't be hit.
      
      Read RIP from within the tracepoint itself to avoid a potential VMREAD
      and retpoline if the guest's RIP isn't available.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923201349.16097-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b2d52255
    • S
      KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1
      Sean Christopherson 提交于
      Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
      more generic is_emulatable(), and unconditionally call the new function
      in x86_emulate_instruction().
      
      KVM will use the generic hook to support multiple security related
      technologies that prevent emulation in one way or another.  Similar to
      the existing AMD #NPF case where emulation of the current instruction is
      not possible due to lack of information, AMD's SEV-ES and Intel's SGX
      and TDX will introduce scenarios where emulation is impossible due to
      the guest's register state being inaccessible.  And again similar to the
      existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
      i.e. outside of the control of vendor-specific code.
      
      While the cause and architecturally visible behavior of the various
      cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
      resume or complete shutdown, and SEV-ES and TDX "return" an error, the
      impact on the common emulation code is identical: KVM must stop
      emulation immediately and resume the guest.
      
      Query is_emulatable() in handle_ud() as well so that the
      force_emulation_prefix code doesn't incorrectly modify RIP before
      calling emulate_instruction() in the absurdly unlikely scenario that
      KVM encounters forced emulation in conjunction with "do not emulate".
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09e3e2a1
    • M
      KVM: x86: fix MSR_IA32_TSC read for nested migration · cc5b54dd
      Maxim Levitsky 提交于
      MSR reads/writes should always access the L1 state, since the (nested)
      hypervisor should intercept all the msrs it wants to adjust, and these
      that it doesn't should be read by the guest as if the host had read it.
      
      However IA32_TSC is an exception. Even when not intercepted, guest still
      reads the value + TSC offset.
      The write however does not take any TSC offset into account.
      
      This is documented in Intel's SDM and seems also to happen on AMD as well.
      
      This creates a problem when userspace wants to read the IA32_TSC value and then
      write it. (e.g for migration)
      
      In this case it reads L2 value but write is interpreted as an L1 value.
      To fix this make the userspace initiated reads of IA32_TSC return L1 value
      as well.
      
      Huge thanks to Dave Gilbert for helping me understand this very confusing
      semantic of MSR writes.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cc5b54dd
    • B
      KVM: X86: Move handling of INVPCID types to x86 · 9715092f
      Babu Moger 提交于
      INVPCID instruction handling is mostly same across both VMX and
      SVM. So, move the code to common x86.c.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <159985255212.11252.10322694343971983487.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9715092f
    • B
      KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c · 3f3393b3
      Babu Moger 提交于
      Handling of kvm_read/write_guest_virt*() errors can be moved to common
      code. The same code can be used by both VMX and SVM.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f3393b3
    • W
      KVM: LAPIC: Narrow down the kick target vCPU · 68ca7663
      Wanpeng Li 提交于
      The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
      fires on a different pCPU which vCPU is running on.  This kick costs about
      1000 clock cycles and we don't need this when injecting already-expired
      timer or when using the VMX preemption timer because
      kvm_lapic_expired_hv_timer() is called from the target vCPU.
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1599731444-3525-6-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      68ca7663
  4. 25 9月, 2020 2 次提交
    • S
      KVM: x86: Reset MMU context if guest toggles CR4.SMAP or CR4.PKE · 8d214c48
      Sean Christopherson 提交于
      Reset the MMU context during kvm_set_cr4() if SMAP or PKE is toggled.
      Recent commits to (correctly) not reload PDPTRs when SMAP/PKE are
      toggled inadvertantly skipped the MMU context reset due to the mask
      of bits that triggers PDPTR loads also being used to trigger MMU context
      resets.
      
      Fixes: 427890af ("kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode")
      Fixes: cb957adb ("kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923215352.17756-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d214c48
    • M
      KVM: x86: fix MSR_IA32_TSC read for nested migration · ee6fa053
      Maxim Levitsky 提交于
      MSR reads/writes should always access the L1 state, since the (nested)
      hypervisor should intercept all the msrs it wants to adjust, and these
      that it doesn't should be read by the guest as if the host had read it.
      
      However IA32_TSC is an exception. Even when not intercepted, guest still
      reads the value + TSC offset.
      The write however does not take any TSC offset into account.
      
      This is documented in Intel's SDM and seems also to happen on AMD as well.
      
      This creates a problem when userspace wants to read the IA32_TSC value and then
      write it. (e.g for migration)
      
      In this case it reads L2 value but write is interpreted as an L1 value.
      To fix this make the userspace initiated reads of IA32_TSC return L1 value
      as well.
      
      Huge thanks to Dave Gilbert for helping me understand this very confusing
      semantic of MSR writes.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200921103805.9102-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee6fa053
  5. 23 9月, 2020 1 次提交
  6. 12 9月, 2020 1 次提交
  7. 24 8月, 2020 1 次提交
  8. 21 8月, 2020 1 次提交
  9. 18 8月, 2020 3 次提交
    • J
      kvm: x86: Toggling CR4.PKE does not load PDPTEs in PAE mode · cb957adb
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Cc: Huaitong Han <huaitong.han@intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-1-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb957adb
    • J
      kvm: x86: Toggling CR4.SMAP does not load PDPTEs in PAE mode · 427890af
      Jim Mattson 提交于
      See the SDM, volume 3, section 4.4.1:
      
      If PAE paging would be in use following an execution of MOV to CR0 or
      MOV to CR4 (see Section 4.1.1) and the instruction is modifying any of
      CR0.CD, CR0.NW, CR0.PG, CR4.PAE, CR4.PGE, CR4.PSE, or CR4.SMEP; then
      the PDPTEs are loaded from the address in CR3.
      
      Fixes: 0be0226f ("KVM: MMU: fix SMAP virtualization")
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20200817181655.3716509-2-jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      427890af
    • P
      KVM: x86: fix access code passed to gva_to_gpa · 19cf4b7e
      Paolo Bonzini 提交于
      The PK bit of the error code is computed dynamically in permission_fault
      and therefore need not be passed to gva_to_gpa: only the access bits
      (fetch, user, write) need to be passed down.
      
      Not doing so causes a splat in the pku test:
      
         WARNING: CPU: 25 PID: 5465 at arch/x86/kvm/mmu.h:197 paging64_walk_addr_generic+0x594/0x750 [kvm]
         Hardware name: Intel Corporation WilsonCity/WilsonCity, BIOS WLYDCRB1.SYS.0014.D62.2001092233 01/09/2020
         RIP: 0010:paging64_walk_addr_generic+0x594/0x750 [kvm]
         Code: <0f> 0b e9 db fe ff ff 44 8b 43 04 4c 89 6c 24 30 8b 13 41 39 d0 89
         RSP: 0018:ff53778fc623fb60 EFLAGS: 00010202
         RAX: 0000000000000001 RBX: ff53778fc623fbf0 RCX: 0000000000000007
         RDX: 0000000000000001 RSI: 0000000000000002 RDI: ff4501efba818000
         RBP: 0000000000000020 R08: 0000000000000005 R09: 00000000004000e7
         R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000007
         R13: ff4501efba818388 R14: 10000000004000e7 R15: 0000000000000000
         FS:  00007f2dcf31a700(0000) GS:ff4501f1c8040000(0000) knlGS:0000000000000000
         CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
         CR2: 0000000000000000 CR3: 0000001dea475005 CR4: 0000000000763ee0
         DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
         DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
         PKRU: 55555554
         Call Trace:
          paging64_gva_to_gpa+0x3f/0xb0 [kvm]
          kvm_fixup_and_inject_pf_error+0x48/0xa0 [kvm]
          handle_exception_nmi+0x4fc/0x5b0 [kvm_intel]
          kvm_arch_vcpu_ioctl_run+0x911/0x1c10 [kvm]
          kvm_vcpu_ioctl+0x23e/0x5d0 [kvm]
          ksys_ioctl+0x92/0xb0
          __x64_sys_ioctl+0x16/0x20
          do_syscall_64+0x3e/0xb0
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
         ---[ end trace d17eb998aee991da ]---
      Reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 89786147 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19cf4b7e
  10. 10 8月, 2020 1 次提交
    • S
      KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled · 05487215
      Sean Christopherson 提交于
      Don't attempt to load PDPTRs if EFER.LME=1, i.e. if 64-bit mode is
      enabled.  A recent change to reload the PDTPRs when CR0.CD or CR0.NW is
      toggled botched the EFER.LME handling and sends KVM down the PDTPR path
      when is_paging() is true, i.e. when the guest toggles CD/NW in 64-bit
      mode.
      
      Split the CR0 checks for 64-bit vs. 32-bit PAE into separate paths.  The
      64-bit path is specifically checking state when paging is toggled on,
      i.e. CR0.PG transititions from 0->1.  The PDPTR path now needs to run if
      the new CR0 state has paging enabled, irrespective of whether paging was
      already enabled.  Trying to shave a few cycles to make the PDPTR path an
      "else if" case is a mess.
      
      Fixes: d42e3fae ("kvm: x86: Read PDPTEs on CR0.CD and CR0.NW changes")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20200714015732.32426-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05487215
  11. 05 8月, 2020 1 次提交
  12. 31 7月, 2020 2 次提交
  13. 30 7月, 2020 1 次提交
    • T
      x86/kvm: Use __xfer_to_guest_mode_work_pending() in kvm_run_vcpu() · f3020b88
      Thomas Gleixner 提交于
      The comments explicitely explain that the work flags check and handling in
      kvm_run_vcpu() is done with preemption and interrupts enabled as KVM
      invokes the check again right before entering guest mode with interrupts
      disabled which guarantees that the work flags are observed and handled
      before VMENTER.
      
      Nevertheless the flag pending check in kvm_run_vcpu() uses the helper
      variant which requires interrupts to be disabled triggering an instant
      lockdep splat. This was caught in testing before and then not fixed up in
      the patch before applying. :(
      
      Use the relaxed and intentionally racy __xfer_to_guest_mode_work_pending()
      instead.
      
      Fixes: 72c3c0fe ("x86/kvm: Use generic xfer to guest work function")
      Reported-by: Qian Cai <cai@lca.pw> writes:
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87bljxa2sa.fsf@nanos.tec.linutronix.de
      
      f3020b88
  14. 24 7月, 2020 1 次提交
  15. 17 7月, 2020 1 次提交
  16. 11 7月, 2020 1 次提交
    • M
      KVM: x86: Add a capability for GUEST_MAXPHYADDR < HOST_MAXPHYADDR support · 3edd6839
      Mohammed Gamal 提交于
      This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which
      allows userspace to query if the underlying architecture would
      support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly
      (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X)
      
      The complications in this patch are due to unexpected (but documented)
      behaviour we see with NPF vmexit handling in AMD processor.  If
      SVM is modified to add guest physical address checks in the NPF
      and guest #PF paths, we see the followning error multiple times in
      the 'access' test in kvm-unit-tests:
      
                  test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001
                  Dump mapping: address: 0x123400000000
                  ------L4: 24c3027
                  ------L3: 24c4027
                  ------L2: 24c5021
                  ------L1: 1002000021
      
      This is because the PTE's accessed bit is set by the CPU hardware before
      the NPF vmexit. This is handled completely by hardware and cannot be fixed
      in software.
      
      Therefore, availability of the new capability depends on a boolean variable
      allow_smaller_maxphyaddr which is set individually by VMX and SVM init
      routines. On VMX it's always set to true, on SVM it's only set to true
      when NPT is not enabled.
      
      CC: Tom Lendacky <thomas.lendacky@amd.com>
      CC: Babu Moger <babu.moger@amd.com>
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Message-Id: <20200710154811.418214-10-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3edd6839