1. 04 2月, 2021 3 次提交
  2. 03 2月, 2021 1 次提交
    • P
      KVM: x86: cleanup CR3 reserved bits checks · c1c35cf7
      Paolo Bonzini 提交于
      If not in long mode, the low bits of CR3 are reserved but not enforced to
      be zero, so remove those checks.  If in long mode, however, the MBZ bits
      extend down to the highest physical address bit of the guest, excluding
      the encryption bit.
      
      Make the checks consistent with the above, and match them between
      nested_vmcb_checks and KVM_SET_SREGS.
      
      Cc: stable@vger.kernel.org
      Fixes: 761e4169 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests")
      Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c1c35cf7
  3. 02 2月, 2021 2 次提交
    • Z
      KVM/x86: assign hva with the right value to vm_munmap the pages · b66f9bab
      Zheng Zhan Liang 提交于
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NZheng Zhan Liang <zhengzhanliang@huorong.cn>
      Message-Id: <20210201055310.267029-1-zhengzhanliang@huorong.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b66f9bab
    • P
      KVM: x86: Allow guests to see MSR_IA32_TSX_CTRL even if tsx=off · 7131636e
      Paolo Bonzini 提交于
      Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST
      will generally use the default value for MSR_IA32_ARCH_CAPABILITIES.
      When this happens and the host has tsx=on, it is possible to end up with
      virtual machines that have HLE and RTM disabled, but TSX_CTRL available.
      
      If the fleet is then switched to tsx=off, kvm_get_arch_capabilities()
      will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to
      use the tsx=off hosts as migration destinations, even though the guests
      do not have TSX enabled.
      
      To allow this migration, allow guests to write to their TSX_CTRL MSR,
      while keeping the host MSR unchanged for the entire life of the guests.
      This ensures that TSX remains disabled and also saves MSR reads and
      writes, and it's okay to do because with tsx=off we know that guests will
      not have the HLE and RTM features in their CPUID.  (If userspace sets
      bogus CPUID data, we do not expect HLE and RTM to work in guests anyway).
      
      Cc: stable@vger.kernel.org
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7131636e
  4. 26 1月, 2021 3 次提交
    • P
      KVM: x86: allow KVM_REQ_GET_NESTED_STATE_PAGES outside guest mode for VMX · 9a78e158
      Paolo Bonzini 提交于
      VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS,
      which may need to be loaded outside guest mode.  Therefore we cannot
      WARN in that case.
      
      However, that part of nested_get_vmcs12_pages is _not_ needed at
      vmentry time.  Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling,
      so that both vmentry and migration (and in the latter case, independent
      of is_guest_mode) do the parts that are needed.
      
      Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3b: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES
      Cc: <stable@vger.kernel.org> # 5.10.x
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a78e158
    • L
      kvm: tracing: Fix unmatched kvm_entry and kvm_exit events · d95df951
      Lorenzo Brescia 提交于
      On VMX, if we exit and then re-enter immediately without leaving
      the vmx_vcpu_run() function, the kvm_entry event is not logged.
      That means we will see one (or more) kvm_exit, without its (their)
      corresponding kvm_entry, as shown here:
      
       CPU-1979 [002] 89.871187: kvm_entry: vcpu 1
       CPU-1979 [002] 89.871218: kvm_exit:  reason MSR_WRITE
       CPU-1979 [002] 89.871259: kvm_exit:  reason MSR_WRITE
      
      It also seems possible for a kvm_entry event to be logged, but then
      we leave vmx_vcpu_run() right away (if vmx->emulation_required is
      true). In this case, we will have a spurious kvm_entry event in the
      trace.
      
      Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run()
      (where trace_kvm_exit() already is).
      
      A trace obtained with this patch applied looks like this:
      
       CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0
       CPU-14295 [000] 8388.395392: kvm_exit:  reason MSR_WRITE
       CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0
       CPU-14295 [000] 8388.395503: kvm_exit:  reason EXTERNAL_INTERRUPT
      
      Of course, not calling trace_kvm_entry() in common x86 code any
      longer means that we need to adjust the SVM side of things too.
      Signed-off-by: NLorenzo Brescia <lorenzo.brescia@edu.unito.it>
      Signed-off-by: NDario Faggioli <dfaggioli@suse.com>
      Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d95df951
    • J
      KVM: x86: get smi pending status correctly · 1f7becf1
      Jay Zhou 提交于
      The injection process of smi has two steps:
      
          Qemu                        KVM
      Step1:
          cpu->interrupt_request &= \
              ~CPU_INTERRUPT_SMI;
          kvm_vcpu_ioctl(cpu, KVM_SMI)
      
                                      call kvm_vcpu_ioctl_smi() and
                                      kvm_make_request(KVM_REQ_SMI, vcpu);
      
      Step2:
          kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
      
                                      call process_smi() if
                                      kvm_check_request(KVM_REQ_SMI, vcpu) is
                                      true, mark vcpu->arch.smi_pending = true;
      
      The vcpu->arch.smi_pending will be set true in step2, unfortunately if
      vcpu paused between step1 and step2, the kvm_run->immediate_exit will be
      set and vcpu has to exit to Qemu immediately during step2 before mark
      vcpu->arch.smi_pending true.
      During VM migration, Qemu will get the smi pending status from KVM using
      KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status
      will be lost.
      Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com>
      Signed-off-by: NShengen Zhuang <zhuangshengen@huawei.com>
      Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1f7becf1
  5. 08 1月, 2021 3 次提交
    • P
      KVM: x86: __kvm_vcpu_halt can be static · 872f36eb
      Paolo Bonzini 提交于
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      872f36eb
    • T
      KVM: SVM: Add support for booting APs in an SEV-ES guest · 647daca2
      Tom Lendacky 提交于
      Typically under KVM, an AP is booted using the INIT-SIPI-SIPI sequence,
      where the guest vCPU register state is updated and then the vCPU is VMRUN
      to begin execution of the AP. For an SEV-ES guest, this won't work because
      the guest register state is encrypted.
      
      Following the GHCB specification, the hypervisor must not alter the guest
      register state, so KVM must track an AP/vCPU boot. Should the guest want
      to park the AP, it must use the AP Reset Hold exit event in place of, for
      example, a HLT loop.
      
      First AP boot (first INIT-SIPI-SIPI sequence):
        Execute the AP (vCPU) as it was initialized and measured by the SEV-ES
        support. It is up to the guest to transfer control of the AP to the
        proper location.
      
      Subsequent AP boot:
        KVM will expect to receive an AP Reset Hold exit event indicating that
        the vCPU is being parked and will require an INIT-SIPI-SIPI sequence to
        awaken it. When the AP Reset Hold exit event is received, KVM will place
        the vCPU into a simulated HLT mode. Upon receiving the INIT-SIPI-SIPI
        sequence, KVM will make the vCPU runnable. It is again up to the guest
        to then transfer control of the AP to the proper location.
      
        To differentiate between an actual HLT and an AP Reset Hold, a new MP
        state is introduced, KVM_MP_STATE_AP_RESET_HOLD, which the vCPU is
        placed in upon receiving the AP Reset Hold exit event. Additionally, to
        communicate the AP Reset Hold exit event up to userspace (if needed), a
        new exit reason is introduced, KVM_EXIT_AP_RESET_HOLD.
      
      A new x86 ops function is introduced, vcpu_deliver_sipi_vector, in order
      to accomplish AP booting. For VMX, vcpu_deliver_sipi_vector is set to the
      original SIPI delivery function, kvm_vcpu_deliver_sipi_vector(). SVM adds
      a new function that, for non SEV-ES guests, invokes the original SIPI
      delivery function, kvm_vcpu_deliver_sipi_vector(), but for SEV-ES guests,
      implements the logic above.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <e8fbebe8eb161ceaabdad7c01a5859a78b424d5e.1609791600.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      647daca2
    • M
      KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES on nested vmexit · f2c7ef3b
      Maxim Levitsky 提交于
      It is possible to exit the nested guest mode, entered by
      svm_set_nested_state prior to first vm entry to it (e.g due to pending event)
      if the nested run was not pending during the migration.
      
      In this case we must not switch to the nested msr permission bitmap.
      Also add a warning to catch similar cases in the future.
      
      Fixes: a7d5c7ce ("KVM: nSVM: delay MSR permission processing to first nested VM run")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210107093854.882483-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f2c7ef3b
  6. 20 12月, 2020 1 次提交
  7. 15 12月, 2020 15 次提交
    • T
      KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests · 16809ecd
      Tom Lendacky 提交于
      The run sequence is different for an SEV-ES guest compared to a legacy or
      even an SEV guest. The guest vCPU register state of an SEV-ES guest will
      be restored on VMRUN and saved on VMEXIT. There is no need to restore the
      guest registers directly and through VMLOAD before VMRUN and no need to
      save the guest registers directly and through VMSAVE on VMEXIT.
      
      Update the svm_vcpu_run() function to skip register state saving and
      restoring and provide an alternative function for running an SEV-ES guest
      in vmenter.S
      
      Additionally, certain host state is restored across an SEV-ES VMRUN. As
      a result certain register states are not required to be restored upon
      VMEXIT (e.g. FS, GS, etc.), so only do that if the guest is not an SEV-ES
      guest.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <fb1c66d32f2194e171b95fc1a8affd6d326e10c1.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16809ecd
    • T
      KVM: SVM: Provide support for SEV-ES vCPU loading · 86137773
      Tom Lendacky 提交于
      An SEV-ES vCPU requires additional VMCB vCPU load/put requirements. SEV-ES
      hardware will restore certain registers on VMEXIT, but not save them on
      VMRUN (see Table B-3 and Table B-4 of the AMD64 APM Volume 2), so make the
      following changes:
      
      General vCPU load changes:
        - During vCPU loading, perform a VMSAVE to the per-CPU SVM save area and
          save the current values of XCR0, XSS and PKRU to the per-CPU SVM save
          area as these registers will be restored on VMEXIT.
      
      General vCPU put changes:
        - Do not attempt to restore registers that SEV-ES hardware has already
          restored on VMEXIT.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <019390e9cb5e93cd73014fa5a040c17d42588733.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86137773
    • T
      KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest · ed02b213
      Tom Lendacky 提交于
      The guest FPU state is automatically restored on VMRUN and saved on VMEXIT
      by the hardware, so there is no reason to do this in KVM. Eliminate the
      allocation of the guest_fpu save area and key off that to skip operations
      related to the guest FPU state.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <173e429b4d0d962c6a443c4553ffdaf31b7665a4.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed02b213
    • T
      KVM: SVM: Do not report support for SMM for an SEV-ES guest · 5719455f
      Tom Lendacky 提交于
      SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
      kvm_x86_ops function to take a struct kvm parameter so that the capability
      can be reported at a VM level.
      
      Since this op is also called during KVM initialization and before a struct
      kvm instance is available, comments will be added to each implementation
      of has_emulated_msr() to indicate the kvm parameter can be null.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5719455f
    • T
      KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES · 5265713a
      Tom Lendacky 提交于
      Since many of the registers used by the SEV-ES are encrypted and cannot
      be read or written, adjust the __get_sregs() / __set_sregs() to take into
      account whether the VMSA/guest state is encrypted.
      
      For __get_sregs(), return the actual value that is in use by the guest
      for all registers being tracked using the write trap support.
      
      For __set_sregs(), skip setting of all guest registers values.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <23051868db76400a9b07a2020525483a1e62dbcf.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5265713a
    • T
      KVM: SVM: Add support for CR4 write traps for an SEV-ES guest · 5b51cb13
      Tom Lendacky 提交于
      For SEV-ES guests, the interception of control register write access
      is not recommended. Control register interception occurs prior to the
      control register being modified and the hypervisor is unable to modify
      the control register itself because the register is located in the
      encrypted register state.
      
      SEV-ES guests introduce new control register write traps. These traps
      provide intercept support of a control register write after the control
      register has been modified. The new control register value is provided in
      the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
      of the guest control registers.
      
      Add support to track the value of the guest CR4 register using the control
      register write trap so that the hypervisor understands the guest operating
      mode.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5b51cb13
    • T
      KVM: SVM: Add support for CR0 write traps for an SEV-ES guest · f27ad38a
      Tom Lendacky 提交于
      For SEV-ES guests, the interception of control register write access
      is not recommended. Control register interception occurs prior to the
      control register being modified and the hypervisor is unable to modify
      the control register itself because the register is located in the
      encrypted register state.
      
      SEV-ES support introduces new control register write traps. These traps
      provide intercept support of a control register write after the control
      register has been modified. The new control register value is provided in
      the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
      of the guest control registers.
      
      Add support to track the value of the guest CR0 register using the control
      register write trap so that the hypervisor understands the guest operating
      mode.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f27ad38a
    • T
      KVM: SVM: Support string IO operations for an SEV-ES guest · 7ed9abfe
      Tom Lendacky 提交于
      For an SEV-ES guest, string-based port IO is performed to a shared
      (un-encrypted) page so that both the hypervisor and guest can read or
      write to it and each see the contents.
      
      For string-based port IO operations, invoke SEV-ES specific routines that
      can complete the operation using common KVM port IO support.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7ed9abfe
    • T
      KVM: SVM: Support MMIO for an SEV-ES guest · 8f423a80
      Tom Lendacky 提交于
      For an SEV-ES guest, MMIO is performed to a shared (un-encrypted) page
      so that both the hypervisor and guest can read or write to it and each
      see the contents.
      
      The GHCB specification provides software-defined VMGEXIT exit codes to
      indicate a request for an MMIO read or an MMIO write. Add support to
      recognize the MMIO requests and invoke SEV-ES specific routines that
      can complete the MMIO operation. These routines use common KVM support
      to complete the MMIO operation.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <af8de55127d5bcc3253d9b6084a0144c12307d4d.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8f423a80
    • T
      KVM: SVM: Create trace events for VMGEXIT MSR protocol processing · 59e38b58
      Tom Lendacky 提交于
      Add trace events for entry to and exit from VMGEXIT MSR protocol
      processing. The vCPU will be common for the trace events. The MSR
      protocol processing is guided by the GHCB GPA in the VMCB, so the GHCB
      GPA will represent the input and output values for the entry and exit
      events, respectively. Additionally, the exit event will contain the
      return code for the event.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <c5b3b440c3e0db43ff2fc02813faa94fa54896b0.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      59e38b58
    • T
      KVM: SVM: Create trace events for VMGEXIT processing · d523ab6b
      Tom Lendacky 提交于
      Add trace events for entry to and exit from VMGEXIT processing. The vCPU
      id and the exit reason will be common for the trace events. The exit info
      fields will represent the input and output values for the entry and exit
      events, respectively.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <25357dca49a38372e8f483753fb0c1c2a70a6898.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d523ab6b
    • T
      KVM: SVM: Prevent debugging under SEV-ES · 8d4846b9
      Tom Lendacky 提交于
      Since the guest register state of an SEV-ES guest is encrypted, debugging
      is not supported. Update the code to prevent guest debugging when the
      guest has protected state.
      
      Additionally, an SEV-ES guest must only and always intercept DR7 reads and
      writes. Update set_dr_intercepts() and clr_dr_intercepts() to account for
      this.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <8db966fa2f9803d6454ce773863025d0e2e7f3cc.1607620209.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8d4846b9
    • T
      KVM: SVM: Add required changes to support intercepts under SEV-ES · f1c6366e
      Tom Lendacky 提交于
      When a guest is running under SEV-ES, the hypervisor cannot access the
      guest register state. There are numerous places in the KVM code where
      certain registers are accessed that are not allowed to be accessed (e.g.
      RIP, CR0, etc). Add checks to prevent register accesses and add intercept
      update support at various points within the KVM code.
      
      Also, when handling a VMGEXIT, exceptions are passed back through the
      GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
      update the SVM intercepts to handle this for SEV-ES guests.
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      [Redo MSR part using the .complete_emulated_msr callback. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1c6366e
    • P
      KVM: x86: introduce complete_emulated_msr callback · f9a4d621
      Paolo Bonzini 提交于
      This will be used by SEV-ES to inject MSR failure via the GHCB.
      Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f9a4d621
    • P
      KVM: x86: use kvm_complete_insn_gp in emulating RDMSR/WRMSR · 8b474427
      Paolo Bonzini 提交于
      Simplify the four functions that handle {kernel,user} {rd,wr}msr, there
      is still some repetition between the two instances of rdmsr but the
      whole business of calling kvm_inject_gp and kvm_skip_emulated_instruction
      can be unified nicely.
      
      Because complete_emulated_wrmsr now becomes essentially a call to
      kvm_complete_insn_gp, remove complete_emulated_msr.
      Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8b474427
  8. 27 11月, 2020 1 次提交
    • P
      KVM: x86: Fix split-irqchip vs interrupt injection window request · 71cc849b
      Paolo Bonzini 提交于
      kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
      a hodge-podge of conditions, hacked together to get something that
      more or less works.  But what is actually needed is much simpler;
      in both cases the fundamental question is, do we have a place to stash
      an interrupt if userspace does KVM_INTERRUPT?
      
      In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
      Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
      unnecessarily restrictive.
      
      In split irqchip mode it's a bit more complicated, we need to check
      kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
      cycle and thus requires ExtINTs not to be masked) as well as
      !pending_userspace_extint(vcpu).  However, there is no need to
      check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
      pending ExtINT state separate from event injection state, and checking
      kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
      priority than APIC interrupts.  In fact the latter fixes a bug:
      when userspace requests an IRQ window vmexit, an interrupt in the
      local APIC can cause kvm_cpu_has_interrupt() to be true and thus
      kvm_vcpu_ready_for_interrupt_injection() to return false.  When this
      happens, vcpu_run does not exit to userspace but the interrupt window
      vmexits keep occurring.  The VM loops without any hope of making progress.
      
      Once we try to fix these with something like
      
           return kvm_arch_interrupt_allowed(vcpu) &&
      -        !kvm_cpu_has_interrupt(vcpu) &&
      -        !kvm_event_needs_reinjection(vcpu) &&
      -        kvm_cpu_accept_dm_intr(vcpu);
      +        (!lapic_in_kernel(vcpu)
      +         ? !vcpu->arch.interrupt.injected
      +         : (kvm_apic_accept_pic_intr(vcpu)
      +            && !pending_userspace_extint(v)));
      
      we realize two things.  First, thanks to the previous patch the complex
      conditional can reuse !kvm_cpu_has_extint(vcpu).  Second, the interrupt
      window request in vcpu_enter_guest()
      
              bool req_int_win =
                      dm_request_for_irq_injection(vcpu) &&
                      kvm_cpu_accept_dm_intr(vcpu);
      
      should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
      it is unnecessary to ask the processor for an interrupt window
      if we would not be able to return to userspace.  Therefore,
      kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
      ANDed with the existing check for masked ExtINT.  It all makes sense:
      
      - we can accept an interrupt from userspace if there is a place
        to stash it (and, for irqchip split, ExtINTs are not masked).
        Interrupts from userspace _can_ be accepted even if right now
        EFLAGS.IF=0.
      
      - in order to tell userspace we will inject its interrupt ("IRQ
        window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
        KVM and the vCPU need to be ready to accept the interrupt.
      
      ... and this is what the patch implements.
      Reported-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Analyzed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NNikos Tsironis <ntsironis@arrikto.com>
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Tested-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      71cc849b
  9. 15 11月, 2020 6 次提交
    • J
      kvm: x86: Sink cpuid update into vendor-specific set_cr4 functions · 2259c17f
      Jim Mattson 提交于
      On emulated VM-entry and VM-exit, update the CPUID bits that reflect
      CR4.OSXSAVE and CR4.PKE.
      
      This fixes a bug where the CPUID bits could continue to reflect L2 CR4
      values after emulated VM-exit to L1. It also fixes a related bug where
      the CPUID bits could continue to reflect L1 CR4 values after emulated
      VM-entry to L2. The latter bug is mainly relevant to SVM, wherein
      CPUID is not a required intercept. However, it could also be relevant
      to VMX, because the code to conditionally update these CPUID bits
      assumes that the guest CPUID and the guest CR4 are always in sync.
      
      Fixes: 8eb3f87d ("KVM: nVMX: fix guest CR4 loading when emulating L2 to L1 exit")
      Fixes: 2acf923e ("KVM: VMX: Enable XSAVE/XRSTOR for guest")
      Fixes: b9baba86 ("KVM, pkeys: expose CPUID/CR4 to guest")
      Reported-by: NAbhiroop Dabral <adabral@paloaltonetworks.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Cc: Dexuan Cui <dexuan.cui@intel.com>
      Cc: Huaitong Han <huaitong.han@intel.com>
      Message-Id: <20201029170648.483210-1-jmattson@google.com>
      2259c17f
    • P
      KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed
      Peter Xu 提交于
      This patch is heavily based on previous work from Lei Cao
      <lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
      
      KVM currently uses large bitmaps to track dirty memory.  These bitmaps
      are copied to userspace when userspace queries KVM for its dirty page
      information.  The use of bitmaps is mostly sufficient for live
      migration, as large parts of memory are be dirtied from one log-dirty
      pass to another.  However, in a checkpointing system, the number of
      dirty pages is small and in fact it is often bounded---the VM is
      paused when it has dirtied a pre-defined number of pages. Traversing a
      large, sparsely populated bitmap to find set bits is time-consuming,
      as is copying the bitmap to user-space.
      
      A similar issue will be there for live migration when the guest memory
      is huge while the page dirty procedure is trivial.  In that case for
      each dirty sync we need to pull the whole dirty bitmap to userspace
      and analyse every bit even if it's mostly zeros.
      
      The preferred data structure for above scenarios is a dense list of
      guest frame numbers (GFN).  This patch series stores the dirty list in
      kernel memory that can be memory mapped into userspace to allow speedy
      harvesting.
      
      This patch enables dirty ring for X86 only.  However it should be
      easily extended to other archs as well.
      
      [1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: NLei Cao <lei.cao@stratus.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012222.5767-1-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb04a1ed
    • P
      KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] · ff5a983c
      Peter Xu 提交于
      Originally, we have three code paths that can dirty a page without
      vcpu context for X86:
      
        - init_rmode_identity_map
        - init_rmode_tss
        - kvmgt_rw_gpa
      
      init_rmode_identity_map and init_rmode_tss will be setup on
      destination VM no matter what (and the guest cannot even see them), so
      it does not make sense to track them at all.
      
      To do this, allow __x86_set_memory_region() to return the userspace
      address that just allocated to the caller.  Then in both of the
      functions we directly write to the userspace address instead of
      calling kvm_write_*() APIs.
      
      Another trivial change is that we don't need to explicitly clear the
      identity page table root in init_rmode_identity_map() because no
      matter what we'll write to the whole page with 4M huge page entries.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20201001012044.5151-4-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff5a983c
    • V
      KVM: x86: hyper-v: allow KVM_GET_SUPPORTED_HV_CPUID as a system ioctl · c21d54f0
      Vitaly Kuznetsov 提交于
      KVM_GET_SUPPORTED_HV_CPUID is a vCPU ioctl but its output is now
      independent from vCPU and in some cases VMMs may want to use it as a system
      ioctl instead. In particular, QEMU doesn CPU feature expansion before any
      vCPU gets created so KVM_GET_SUPPORTED_HV_CPUID can't be used.
      
      Convert KVM_GET_SUPPORTED_HV_CPUID to 'dual' system/vCPU ioctl with the
      same meaning.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200929150944.1235688-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c21d54f0
    • S
      KVM: x86: Return bool instead of int for CR4 and SREGS validity checks · ee69c92b
      Sean Christopherson 提交于
      Rework the common CR4 and SREGS checks to return a bool instead of an
      int, i.e. true/false instead of 0/-EINVAL, and add "is" to the name to
      clarify the polarity of the return value (which is effectively inverted
      by this change).
      
      No functional changed intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-6-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee69c92b
    • S
      KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4
      Sean Christopherson 提交于
      Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
      and invoke the new hook from kvm_valid_cr4().  This fixes an issue where
      KVM_SET_SREGS would return success while failing to actually set CR4.
      
      Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
      in __set_sregs() is not a viable option as KVM has already stuffed a
      variety of vCPU state.
      
      Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
      inverted semantics.  This will be remedied in a future patch.
      
      Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c2fe3cd4
  10. 13 11月, 2020 1 次提交
    • B
      KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a
      Babu Moger 提交于
      SEV guests fail to boot on a system that supports the PCID feature.
      
      While emulating the RSM instruction, KVM reads the guest CR3
      and calls kvm_set_cr3(). If the vCPU is in the long mode,
      kvm_set_cr3() does a sanity check for the CR3 value. In this case,
      it validates whether the value has any reserved bits set. The
      reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
      encryption is enabled, the memory encryption bit is set in the CR3
      value. The memory encryption bit may fall within the KVM reserved
      bit range, causing the KVM emulation failure.
      
      Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
      cache the reserved bits in the CR3 value. This will be initialized
      to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).
      
      If the architecture has any special bits(like AMD SEV encryption bit)
      that needs to be masked from the reserved bits, should be cleared
      in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.
      
      Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0107973a
  11. 08 11月, 2020 4 次提交