1. 18 6月, 2019 17 次提交
    • S
      KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES · b6437805
      Sean Christopherson 提交于
      VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit
      mode and CPL respectively, but effectively never write said fields once
      the VM is initialized.  Intercepting VMWRITEs for the two fields saves
      ~55 cycles in copy_shadow_to_vmcs12().
      
      Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the
      guest access rights fields on VMWRITE, exposing the fields to L1 for
      VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2.
      On hardware that drops the bits, L1 will see the stripped down value due
      to reading the value from hardware, while L2 will see the full original
      value as stored by KVM.  To avoid such an inconsistency, emulate the
      behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid
      introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the
      emulation were added to vmcs12_write_any().
      
      Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a
      future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM
      would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM
      would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE
      to the shadow VMCS would not drop the bits.  Add a WARN in the shadow
      field initialization to detect any attempt to expose an AR_BYTES field
      without updating vmcs12_write_any().
      
      Note, emulation of the AR_BYTES reserved bit behavior is based on a
      patch[1] from Jim Mattson that applied the emulation to all writes to
      vmcs12 so that live migration across different generations of hardware
      would not introduce divergent behavior.  But given that live migration
      of nested state has already been enabled, that ship has sailed (not to
      mention that no sane VMM will be affected by this behavior).
      
      [1] https://patchwork.kernel.org/patch/10483321/
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6437805
    • S
      KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields · fadcead0
      Sean Christopherson 提交于
      Allowing L1 to VMWRITE read-only fields is only beneficial in a double
      nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal
      non-nested operation.  Intercepting RO fields means KVM doesn't need to
      sync them from the shadow VMCS to vmcs12 when running L2.  The obvious
      downside is that L1 will VM-Exit more often when running L3, but it's
      likely safe to assume most folks would happily sacrifice a bit of L3
      performance, which may not even be noticeable in the grande scheme, to
      improve L2 performance across the board.
      
      Not intercepting fields tagged read-only also allows for additional
      optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO
      since those fields are rarely written by a VMMs, but read frequently.
      
      When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields
      that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to
      the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD
      from L1 will consume a stale value.
      
      Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field"
      is not exposed to L1, but only so that it can reject the VMWRITE, i.e.
      propagating the VMWRITE to the shadow VMCS is a new requirement, not a
      bug fix.
      
      Eliminating the copying of RO fields reduces the latency of nested
      VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles
      if/when the AR_BYTES fields are exposed RO).
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fadcead0
    • S
      KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c
      Sean Christopherson 提交于
      Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
      interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
      to "make sure we handle NMI on the current cpu, and that we don't
      service maskable interrupts before non-maskable ones".  The other
      exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
      have similar requirements, and are located there to avoid extra VMREADs
      since VMX bins hardware exceptions and NMIs into a single exit reason.
      
      Clean up the code and eliminate the vaguely named complete_atomic_exit()
      by moving the interrupts-disabled exception and NMI handling into the
      existing handle_external_intrs() callback, and rename the callback to
      a more appropriate name.  Rename VMexit handlers throughout so that the
      atomic and non-atomic counterparts have similar names.
      
      In addition to improving code readability, this also ensures the NMI
      handler is run with the host's debug registers loaded in the unlikely
      event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
      field is also improved when handling NMIs (and #MCs) as the handler
      will run after updating said field.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Naming cleanups. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95b5a48c
    • S
      KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0
      Sean Christopherson 提交于
      VMX can conditionally call kvm_{before,after}_interrupt() since KVM
      always uses "ack interrupt on exit" and therefore explicitly handles
      interrupts as opposed to blindly enabling irqs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      165072b0
    • S
      KVM: VMX: Store the host kernel's IDT base in a global variable · 2342080c
      Sean Christopherson 提交于
      Although the kernel may use multiple IDTs, KVM should only ever see the
      "real" IDT, e.g. the early init IDT is long gone by the time KVM runs
      and the debug stack IDT is only used for small windows of time in very
      specific flows.
      
      Before commit a547c6db ("KVM: VMX: Enable acknowledge interupt on
      vmexit"), the kernel's IDT base was consumed by KVM only when setting
      constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE.  Because constant
      host state is done once per vCPU, there was ostensibly no need to cache
      the kernel's IDT base.
      
      When support for "ack interrupt on exit" was introduced, KVM added a
      second consumer of the IDT base as handling already-acked interrupts
      requires directly calling the interrupt handler, i.e. KVM uses the IDT
      base to find the address of the handler.  Because interrupts are a fast
      path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE.
      Presumably, the IDT base was cached on a per-vCPU basis simply because
      the existing code grabbed the IDT base on a per-vCPU (VMCS) basis.
      
      Note, all post-boot IDTs use the same handlers for external interrupts,
      i.e. the "ack interrupt on exit" use of the IDT base would be unaffected
      even if the cached IDT somehow did not match the current IDT.  And as
      for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the
      above analysis is wrong then KVM has had a bug since the beginning of
      time since KVM has effectively been caching the IDT at vCPU creation
      since commit a8b732ca01c ("[PATCH] kvm: userspace interface").
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2342080c
    • S
      KVM: VMX: Read cached VM-Exit reason to detect external interrupt · 49def500
      Sean Christopherson 提交于
      Generic x86 code invokes the kvm_x86_ops external interrupt handler on
      all VM-Exits regardless of the actual exit type.  Use the already-cached
      EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus
      avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other
      types of VM-Exit.
      
      In addition to avoiding the extra VMREAD, checking the EXIT_REASON
      instead of VM_EXIT_INTR_INFO makes it more obvious that
      vmx_handle_external_intr() is called for all VM-Exits, e.g. someone
      unfamiliar with the flow might wonder under what condition(s)
      VM_EXIT_INTR_INFO does not contain a valid interrupt, which is
      simply not possible since KVM always runs with "ack interrupt on exit".
      
      WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on
      an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a
      hardware bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49def500
    • P
      kvm: nVMX: small cleanup in handle_exception · 2ea72039
      Paolo Bonzini 提交于
      The reason for skipping handling of NMI and #MC in handle_exception is
      the same, namely they are handled earlier by vmx_complete_atomic_exit.
      Calling the machine check handler (which just returns 1) is misleading,
      don't do it.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ea72039
    • S
      KVM: VMX: Fix handling of #MC that occurs during VM-Entry · beb8d93b
      Sean Christopherson 提交于
      A previous fix to prevent KVM from consuming stale VMCS state after a
      failed VM-Entry inadvertantly blocked KVM's handling of machine checks
      that occur during VM-Entry.
      
      Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways,
      depending on when the #MC is recognoized.  As it pertains to this bug
      fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY
      is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to
      indicate the VM-Entry failed.
      
      If a machine-check event occurs during a VM entry, one of the following occurs:
       - The machine-check event is handled as if it occurred before the VM entry:
              ...
       - The machine-check event is handled after VM entry completes:
              ...
       - A VM-entry failure occurs as described in Section 26.7. The basic
         exit reason is 41, for "VM-entry failure due to machine-check event".
      
      Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in
      vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit().
      Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY
      in a sane fashion and also simplifies vmx_complete_atomic_exit() since
      VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh.
      
      Fixes: b060ca3b ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      beb8d93b
    • P
      KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4
      Paolo Bonzini 提交于
      Make it available to AMD hosts as well, just in case someone is trying
      to use an Intel processor's CPUID setup.
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73f624f4
    • W
      kvm: x86: offset is ensure to be in range · 4cb8b116
      Wei Yang 提交于
      In function apic_mmio_write(), the offset has been checked in:
      
         * apic_mmio_in_range()
         * offset & 0xf
      
      These two ensures offset is in range [0x010, 0xff0].
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4cb8b116
    • W
      kvm: x86: use same convention to name kvm_lapic_{set,clear}_vector() · ee171d2f
      Wei Yang 提交于
      apic_clear_vector() is the counterpart of kvm_lapic_set_vector(),
      while they have different naming convention.
      
      Rename it and move together to arch/x86/kvm/lapic.h. Also fix one typo
      in comment by hand.
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee171d2f
    • W
      kvm: x86: check kvm_apic_sw_enabled() is enough · 7d2296bf
      Wei Yang 提交于
      On delivering irq to apic, we iterate on vcpu and do the check like
      this:
      
          kvm_apic_present(vcpu)
          kvm_lapic_enabled(vpu)
              kvm_apic_present(vcpu) && kvm_apic_sw_enabled(vcpu->arch.apic)
      
      Since we have already checked kvm_apic_present(), it is reasonable to
      replace kvm_lapic_enabled() with kvm_apic_sw_enabled().
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d2296bf
    • M
      kvm: x86: add host poll control msrs · 2d5ba19b
      Marcelo Tosatti 提交于
      Add an MSRs which allows the guest to disable
      host polling (specifically the cpuidle-haltpoll,
      when performing polling in the guest, disables
      host side polling).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d5ba19b
    • E
      kvm: vmx: segment limit check: use access length · fdb28619
      Eugene Korenevsky 提交于
      There is an imperfection in get_vmx_mem_address(): access length is ignored
      when checking the limit. To fix this, pass access length as a function argument.
      The access length is usually obvious since it is used by callers after
      get_vmx_mem_address() call, but for vmread/vmwrite it depends on the
      state of 64-bit mode.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdb28619
    • E
      kvm: vmx: fix limit checking in get_vmx_mem_address() · c1a9acbc
      Eugene Korenevsky 提交于
      Intel SDM vol. 3, 5.3:
      The processor causes a
      general-protection exception (or, if the segment is SS, a stack-fault
      exception) any time an attempt is made to access the following addresses
      in a segment:
      - A byte at an offset greater than the effective limit
      - A word at an offset greater than the (effective-limit – 1)
      - A doubleword at an offset greater than the (effective-limit – 3)
      - A quadword at an offset greater than the (effective-limit – 7)
      
      Therefore, the generic limit checking error condition must be
      
      exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit)
      
      but not
      
      exn = (off + access_len > limit)
      
      as for now.
      
      Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64.
      
      Note: access length is currently sizeof(u64) which is incorrect. This
      will be fixed in the subsequent patch.
      Signed-off-by: NEugene Korenevsky <ekorenevsky@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c1a9acbc
    • L
      KVM: x86: Add Intel CPUID.1F cpuid emulation support · a87f2d3a
      Like Xu 提交于
      Add support to expose Intel V2 Extended Topology Enumeration Leaf for
      some new systems with multiple software-visible die within each package.
      
      Because unimplemented and unexposed leaves should be explicitly reported
      as zero, there is no need to limit cpuid.0.eax to the maximum value of
      feature configuration but limit it to the highest leaf implemented in
      the current code. A single clamping seems sufficient and cheaper.
      Co-developed-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a87f2d3a
    • L
      KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 · 1fc5d194
      Liran Alon 提交于
      Make all code consistent with kvm_deliver_exception_payload() by using
      appropriate symbolic constant instead of hard-coded number.
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1fc5d194
  2. 14 6月, 2019 1 次提交
    • P
      KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e
      Paolo Bonzini 提交于
      Even when asynchronous page fault is disabled, KVM does not want to pause
      the host if a guest triggers a page fault; instead it will put it into
      an artificial HLT state that allows running other host processes while
      allowing interrupt delivery into the guest.
      
      However, the way this feature is triggered is a bit confusing.
      First, it is not used for page faults while a nested guest is
      running: but this is not an issue since the artificial halt
      is completely invisible to the guest, either L1 or L2.  Second,
      it is used even if kvm_halt_in_guest() returns true; in this case,
      the guest probably should not pay the additional latency cost of the
      artificial halt, and thus we should handle the page fault in a
      completely synchronous way.
      
      By introducing a new function kvm_can_deliver_async_pf, this patch
      commonizes the code that chooses whether to deliver an async page fault
      (kvm_arch_async_page_not_present) and the code that chooses whether a
      page fault should be handled synchronously (kvm_can_do_async_pf).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1dfdb45e
  3. 05 6月, 2019 20 次提交
  4. 01 6月, 2019 2 次提交