1. 09 1月, 2023 1 次提交
    • P
      KVM: nSVM: clarify recalc_intercepts() wrt CR8 · 74905e3d
      Paolo Bonzini 提交于
      The mysterious comment "We only want the cr8 intercept bits of L1"
      dates back to basically the introduction of nested SVM, back when
      the handling of "less typical" hypervisors was very haphazard.
      With the development of kvm-unit-tests for interrupt handling,
      the same code grew another vmcb_clr_intercept for the interrupt
      window (VINTR) vmexit, this time with a comment that is at least
      decent.
      
      It turns out however that the same comment applies to the CR8 write
      intercept, which is also a "recheck if an interrupt should be
      injected" intercept.  The CR8 read intercept instead has not
      been used by KVM for 14 years (commit 649d6864, "KVM: SVM:
      sync TPR value to V_TPR field in the VMCB"), so do not bother
      clearing it and let one comment describe both CR8 write and VINTR
      handling.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74905e3d
  2. 19 11月, 2022 6 次提交
  3. 18 11月, 2022 3 次提交
  4. 10 11月, 2022 2 次提交
  5. 27 9月, 2022 6 次提交
  6. 29 7月, 2022 2 次提交
    • S
      KVM: x86: Split kvm_is_valid_cr4() and export only the non-vendor bits · c33f6f22
      Sean Christopherson 提交于
      Split the common x86 parts of kvm_is_valid_cr4(), i.e. the reserved bits
      checks, into a separate helper, __kvm_is_valid_cr4(), and export only the
      inner helper to vendor code in order to prevent nested VMX from calling
      back into vmx_is_valid_cr4() via kvm_is_valid_cr4().
      
      On SVM, this is a nop as SVM doesn't place any additional restrictions on
      CR4.
      
      On VMX, this is also currently a nop, but only because nested VMX is
      missing checks on reserved CR4 bits for nested VM-Enter.  That bug will
      be fixed in a future patch, and could simply use kvm_is_valid_cr4() as-is,
      but nVMX has _another_ bug where VMXON emulation doesn't enforce VMX's
      restrictions on CR0/CR4.  The cleanest and most intuitive way to fix the
      VMXON bug is to use nested_host_cr{0,4}_valid().  If the CR4 variant
      routes through kvm_is_valid_cr4(), using nested_host_cr4_valid() won't do
      the right thing for the VMXON case as vmx_is_valid_cr4() enforces VMX's
      restrictions if and only if the vCPU is post-VMXON.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220607213604.3346000-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c33f6f22
    • M
      KVM: nSVM: Pull CS.Base from actual VMCB12 for soft int/ex re-injection · da0b93d6
      Maciej S. Szmigiero 提交于
      enter_svm_guest_mode() first calls nested_vmcb02_prepare_control() to copy
      control fields from VMCB12 to the current VMCB, then
      nested_vmcb02_prepare_save() to perform a similar copy of the save area.
      
      This means that nested_vmcb02_prepare_control() still runs with the
      previous save area values in the current VMCB so it shouldn't take the L2
      guest CS.Base from this area.
      
      Explicitly pull CS.Base from the actual VMCB12 instead in
      enter_svm_guest_mode().
      
      Granted, having a non-zero CS.Base is a very rare thing (and even
      impossible in 64-bit mode), having it change between nested VMRUNs is
      probably even rarer, but if it happens it would create a really subtle bug
      so it's better to fix it upfront.
      
      Fixes: 6ef88d6e ("KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction")
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <4caa0f67589ae3c22c311ee0e6139496902f2edc.1658159083.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da0b93d6
  7. 25 6月, 2022 1 次提交
  8. 09 6月, 2022 1 次提交
    • P
      KVM: x86: SVM: fix nested PAUSE filtering when L0 intercepts PAUSE · e3cdaab5
      Paolo Bonzini 提交于
      Commit 74fd41ed ("KVM: x86: nSVM: support PAUSE filtering when L0
      doesn't intercept PAUSE") introduced passthrough support for nested pause
      filtering, (when the host doesn't intercept PAUSE) (either disabled with
      kvm module param, or disabled with '-overcommit cpu-pm=on')
      
      Before this commit, L1 KVM didn't intercept PAUSE at all; afterwards,
      the feature was exposed as supported by KVM cpuid unconditionally, thus
      if L1 could try to use it even when the L0 KVM can't really support it.
      
      In this case the fallback caused KVM to intercept each PAUSE instruction;
      in some cases, such intercept can slow down the nested guest so much
      that it can fail to boot.  Instead, before the problematic commit KVM
      was already setting both thresholds to 0 in vmcb02, but after the first
      userspace VM exit shrink_ple_window was called and would reset the
      pause_filter_count to the default value.
      
      To fix this, change the fallback strategy - ignore the guest threshold
      values, but use/update the host threshold values unless the guest
      specifically requests disabling PAUSE filtering (either simple or
      advanced).
      
      Also fix a minor bug: on nested VM exit, when PAUSE filter counter
      were copied back to vmcb01, a dirty bit was not set.
      
      Thanks a lot to Suravee Suthikulpanit for debugging this!
      
      Fixes: 74fd41ed ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE")
      Reported-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Tested-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Co-developed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220518072709.730031-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e3cdaab5
  9. 08 6月, 2022 5 次提交
    • S
      KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745
      Sean Christopherson 提交于
      Add kvm_caps to hold a variety of capabilites and defaults that aren't
      handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
      the amount of boilerplate code required to add a new feature.  The vast
      majority (all?) of the caps interact with vendor code and are written
      only during initialization, i.e. should be tagged __read_mostly, declared
      extern in x86.h, and exported.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      938c8745
    • M
      KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection · 159fc6fa
      Maciej S. Szmigiero 提交于
      A NMI that L1 wants to inject into its L2 should be directly re-injected,
      without causing L0 side effects like engaging NMI blocking for L1.
      
      It's also worth noting that in this case it is L1 responsibility
      to track the NMI window status for its L2 guest.
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      159fc6fa
    • S
      KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" · 7e5b5ef8
      Sean Christopherson 提交于
      Re-inject INTn software interrupts instead of retrying the instruction if
      the CPU encountered an intercepted exception while vectoring the INTn,
      e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
      instruction is architecturally wrong e.g. will result in a spurious #DB
      if there's a code breakpoint on the INT3/O, and lack of re-injection also
      breaks nested virtualization, e.g. if L1 injects a software interrupt and
      vectoring the injected interrupt encounters an exception that is
      intercepted by L0 but not L1.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e5b5ef8
    • S
      KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction · 6ef88d6e
      Sean Christopherson 提交于
      Re-inject INT3/INTO instead of retrying the instruction if the CPU
      encountered an intercepted exception while vectoring the software
      exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
      shadow paging.  Retrying the instruction is architecturally wrong, e.g.
      will result in a spurious #DB if there's a code breakpoint on the INT3/O,
      and lack of re-injection also breaks nested virtualization, e.g. if L1
      injects a software exception and vectoring the injected exception
      encounters an exception that is intercepted by L0 but not L1.
      
      Due to, ahem, deficiencies in the SVM architecture, acquiring the next
      RIP may require flowing through the emulator even if NRIPS is supported,
      as the CPU clears next_rip if the VM-Exit is due to an exception other
      than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
      deal with this, "skip" the instruction to calculate next_rip (if it's
      not already known), and then unwind the RIP write and any side effects
      (RFLAGS updates).
      
      Save the computed next_rip and use it to re-stuff next_rip if injection
      doesn't complete.  This allows KVM to do the right thing if next_rip was
      known prior to injection, e.g. if L1 injects a soft event into L2, and
      there is no backing INTn instruction, e.g. if L1 is injecting an
      arbitrary event.
      
      Note, it's impossible to guarantee architectural correctness given SVM's
      architectural flaws.  E.g. if the guest executes INTn (no KVM injection),
      an exit occurs while vectoring the INTn, and the guest modifies the code
      stream while the exit is being handled, KVM will compute the incorrect
      next_rip due to "skipping" the wrong instruction.  A future enhancement
      to make this less awful would be for KVM to detect that the decoded
      instruction is not the correct INTn and drop the to-be-injected soft
      event (retrying is a lesser evil compared to shoving the wrong RIP on the
      exception stack).
      Reported-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ef88d6e
    • M
      KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 · 00f08d99
      Maciej S. Szmigiero 提交于
      The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
      This field value (instead of the saved guest RIP) in used by the CPU for
      the return address pushed on stack when injecting a software interrupt or
      INT3 or INTO exception.
      
      Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
      loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
      in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
      vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
      saves RIP on the stack as-is).
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00f08d99
  10. 07 6月, 2022 1 次提交
    • M
      KVM: SVM: fix tsc scaling cache logic · 11d39e8c
      Maxim Levitsky 提交于
      SVM uses a per-cpu variable to cache the current value of the
      tsc scaling multiplier msr on each cpu.
      
      Commit 1ab9287a
      ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      broke this caching logic.
      
      Refactor the code so that all TSC scaling multiplier writes go through
      a single function which checks and updates the cache.
      
      This fixes the following scenario:
      
      1. A CPU runs a guest with some tsc scaling ratio.
      
      2. New guest with different tsc scaling ratio starts on this CPU
         and terminates almost immediately.
      
         This ensures that the short running guest had set the tsc scaling ratio just
         once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
         the per-cpu cache is not updated.
      
      3. The original guest continues to run, it doesn't restore the msr
         value back to its own value, because the cache matches,
         and thus continues to run with a wrong tsc scaling ratio.
      
      Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11d39e8c
  11. 30 4月, 2022 1 次提交
  12. 14 4月, 2022 1 次提交
    • S
      KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2 · 45846661
      Sean Christopherson 提交于
      Remove WARNs that sanity check that KVM never lets a triple fault for L2
      escape and incorrectly end up in L1.  In normal operation, the sanity
      check is perfectly valid, but it incorrectly assumes that it's impossible
      for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
      KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
      the triple fault).
      
      The WARN can currently be triggered if userspace injects a machine check
      while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
      of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
      lost on migration, will make it trivially easy for userspace to trigger
      the WARN.
      
      Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
      tempting, but wrong, especially if/when the request is saved/restored,
      e.g. if userspace restores events (including a triple fault) and then
      restores nested state (which may forcibly leave guest mode).  Ignoring
      the fact that KVM doesn't currently provide the necessary APIs, it's
      userspace's responsibility to manage pending events during save/restore.
      
        ------------[ cut here ]------------
        WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Call Trace:
         <TASK>
         vmx_leave_nested+0x30/0x40 [kvm_intel]
         vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
         kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
         kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1")
      Cc: stable@vger.kernel.org
      Cc: Chenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45846661
  13. 02 4月, 2022 7 次提交
  14. 25 2月, 2022 1 次提交
    • P
      KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d
      Paolo Bonzini 提交于
      Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
      shadow_root_level anymore, pull the PGD load after the initialization of
      the shadow MMUs.
      
      Besides being more intuitive, this enables future simplifications
      and optimizations because it's not necessary anymore to compute the
      role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
      attempting to use a cached PGD to avoid having to figure out the new role.
      With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
      and avoid unloading all the cached roots.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cffc89d
  15. 11 2月, 2022 2 次提交
    • V
      KVM: nSVM: Implement Enlightened MSR-Bitmap feature · 66c03a92
      Vitaly Kuznetsov 提交于
      Similar to nVMX commit 502d2bf5 ("KVM: nVMX: Implement Enlightened MSR
      Bitmap feature"), add support for the feature for nSVM (Hyper-V on KVM).
      
      Notable differences from nVMX implementation:
      - As the feature uses SW reserved fields in VMCB control, KVM needs to
      make sure it's dealing with a Hyper-V guest (kvm_hv_hypercall_enabled()).
      
      - 'msrpm_base_pa' needs to be always be overwritten in
      nested_svm_vmrun_msrpm(), even when the update is skipped. As an
      optimization, nested_vmcb02_prepare_control() copies it from VMCB01
      so when MSR-Bitmap feature for L2 is disabled nothing needs to be done.
      
      - 'struct vmcb_ctrl_area_cached' needs to be extended with clean
      fields/sw reserved data and __nested_copy_vmcb_control_to_cache() needs to
      copy it so nested_svm_vmrun_msrpm() can use it later.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220202095100.129834-5-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66c03a92
    • V
      KVM: nSVM: Track whether changes in L0 require MSR bitmap for L2 to be rebuilt · 73c25546
      Vitaly Kuznetsov 提交于
      Similar to nVMX commit ed2a4800 ("KVM: nVMX: Track whether changes in
      L0 require MSR bitmap for L2 to be rebuilt"), introduce a flag to keep
      track of whether MSR bitmap for L2 needs to be rebuilt due to changes in
      MSR bitmap for L1 or switching to a different L2.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220202095100.129834-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73c25546