1. 14 5月, 2020 6 次提交
    • S
      KVM: nVMX: Open a window for pending nested VMX preemption timer · d2060bd4
      Sean Christopherson 提交于
      Add a kvm_x86_ops hook to detect a nested pending "hypervisor timer" and
      use it to effectively open a window for servicing the expired timer.
      Like pending SMIs on VMX, opening a window simply means requesting an
      immediate exit.
      
      This fixes a bug where an expired VMX preemption timer (for L2) will be
      delayed and/or lost if a pending exception is injected into L2.  The
      pending exception is rightly prioritized by vmx_check_nested_events()
      and injected into L2, with the preemption timer left pending.  Because
      no window opened, L2 is free to run uninterrupted.
      
      Fixes: f4124500 ("KVM: nVMX: Fully emulate preemption timer")
      Reported-by: NJim Mattson <jmattson@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200423022550.15113-3-sean.j.christopherson@intel.com>
      [Check it in kvm_vcpu_has_events too, to ensure that the preemption
       timer is serviced promptly even if the vCPU is halted and L1 is not
       intercepting HLT. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d2060bd4
    • S
      KVM: nVMX: Preserve exception priority irrespective of exiting behavior · 6ce347af
      Sean Christopherson 提交于
      Short circuit vmx_check_nested_events() if an exception is pending and
      needs to be injected into L2, priority between coincident events is not
      dependent on exiting behavior.  This fixes a bug where a single-step #DB
      that is not intercepted by L1 is incorrectly dropped due to servicing a
      VMX Preemption Timer VM-Exit.
      
      Injected exceptions also need to be blocked if nested VM-Enter is
      pending or an exception was already injected, otherwise injecting the
      exception could overwrite an existing event injection from L1.
      Technically, this scenario should be impossible, i.e. KVM shouldn't
      inject its own exception during nested VM-Enter.  This will be addressed
      in a future patch.
      
      Note, event priority between SMI, NMI and INTR is incorrect for L2, e.g.
      SMI should take priority over VM-Exit on NMI/INTR, and NMI that is
      injected into L2 should take priority over VM-Exit INTR.  This will also
      be addressed in a future patch.
      
      Fixes: b6b8a145 ("KVM: nVMX: Rework interception of IRQs and NMIs")
      Reported-by: NJim Mattson <jmattson@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200423022550.15113-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ce347af
    • C
      KVM: SVM: Implement check_nested_events for NMI · 9c3d370a
      Cathy Avery 提交于
      Migrate nested guest NMI intercept processing
      to new check_nested_events.
      Signed-off-by: NCathy Avery <cavery@redhat.com>
      Message-Id: <20200414201107.22952-2-cavery@redhat.com>
      [Reorder clauses as NMIs have higher priority than IRQs; inject
       immediate vmexit as is now done for IRQ vmexits. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9c3d370a
    • P
      KVM: SVM: immediately inject INTR vmexit · 6e085cbf
      Paolo Bonzini 提交于
      We can immediately leave SVM guest mode in svm_check_nested_events
      now that we have the nested_run_pending mechanism.  This makes
      things easier because we can run the rest of inject_pending_event
      with GIF=0, and KVM will naturally end up requesting the next
      interrupt window.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6e085cbf
    • P
      KVM: SVM: leave halted state on vmexit · 38c0b192
      Paolo Bonzini 提交于
      Similar to VMX, we need to leave the halted state when performing a vmexit.
      Failure to do so will cause a hang after vmexit.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      38c0b192
    • P
      KVM: SVM: introduce nested_run_pending · f74f9414
      Paolo Bonzini 提交于
      We want to inject vmexits immediately from svm_check_nested_events,
      so that the interrupt/NMI window requests happen in inject_pending_event
      right after it returns.
      
      This however has the same issue as in vmx_check_nested_events, so
      introduce a nested_run_pending flag with the exact same purpose
      of delaying vmexit injection after the vmentry.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f74f9414
  2. 13 5月, 2020 1 次提交
    • B
      KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c · 37486135
      Babu Moger 提交于
      Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU
      resource isn't. It can be read with XSAVE and written with XRSTOR.
      So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state),
      the guest can read the host value.
      
      In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could
      potentially use XRSTOR to change the host PKRU value.
      
      While at it, move pkru state save/restore to common code and the
      host_pkru field to kvm_vcpu_arch.  This will let SVM support protection keys.
      
      Cc: stable@vger.kernel.org
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37486135
  3. 08 5月, 2020 5 次提交
    • S
      KVM: SVM: Disable AVIC before setting V_IRQ · 7d611233
      Suravee Suthikulpanit 提交于
      The commit 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window
      while running L2 with V_INTR_MASKING=1") introduced a WARN_ON,
      which checks if AVIC is enabled when trying to set V_IRQ
      in the VMCB for enabling irq window.
      
      The following warning is triggered because the requesting vcpu
      (to deactivate AVIC) does not get to process APICv update request
      for itself until the next #vmexit.
      
      WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd]
       RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd]
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm]
        ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
        ? _copy_to_user+0x26/0x30
        ? kvm_vm_ioctl+0xb3e/0xd90 [kvm]
        ? set_next_entity+0x78/0xc0
        kvm_vcpu_ioctl+0x236/0x610 [kvm]
        ksys_ioctl+0x8a/0xc0
        __x64_sys_ioctl+0x1a/0x20
        do_syscall_64+0x58/0x210
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes by sending APICV update request to all other vcpus, and
      immediately update APIC for itself.
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Link: https://lkml.org/lkml/2020/5/2/167
      Fixes: 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
      Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d611233
    • S
      KVM: Introduce kvm_make_all_cpus_request_except() · 54163a34
      Suravee Suthikulpanit 提交于
      This allows making request to all other vcpus except the one
      specified in the parameter.
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54163a34
    • P
      KVM: VMX: pass correct DR6 for GD userspace exit · 45981ded
      Paolo Bonzini 提交于
      When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD),
      DR6 was incorrectly copied from the value in the VM.  Instead,
      DR6.BD should be set in order to catch this case.
      
      On AMD this does not need any special code because the processor triggers
      a #DB exception that is intercepted.  However, the testcase would fail
      without the previous patch because both DR6.BS and DR6.BD would be set.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45981ded
    • P
      KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 · d67668e9
      Paolo Bonzini 提交于
      There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
      different handling of DR6 on intercepted #DB exceptions on Intel and AMD.
      
      On Intel, #DB exceptions transmit the DR6 value via the exit qualification
      field of the VMCS, and the exit qualification only contains the description
      of the precise event that caused a vmexit.
      
      On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
      was to be injected into the guest.  This has two effects when guest debugging
      is in use:
      
      * the guest DR6 is clobbered
      
      * the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
      than just the last one that happened (the testcase in the next patch covers
      this issue).
      
      This patch fixes both issues by emulating, so to speak, the Intel behavior
      on AMD processors.  The important observation is that (after the previous
      patches) the VMCB value of DR6 is only ever observable from the guest is
      KVM_DEBUGREG_WONT_EXIT is set.  Therefore we can actually set vmcb->save.dr6
      to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
      will be if guest debugging is enabled.
      
      Therefore it is possible to enter the guest with an all-zero DR6,
      reconstruct the #DB payload from the DR6 we get at exit time, and let
      kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
      Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
      is set, but this is harmless.
      
      This may not be the most optimized way to deal with this, but it is
      simple and, being confined within SVM code, it gets rid of the set_dr6
      callback and kvm_update_dr6.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d67668e9
    • P
      KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 · 5679b803
      Paolo Bonzini 提交于
      kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the
      second argument.  Ensure that the VMCB value is synchronized to
      vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so
      that the current value of DR6 is always available in vcpu->arch.dr6.
      The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5679b803
  4. 07 5月, 2020 5 次提交
  5. 06 5月, 2020 2 次提交
  6. 05 5月, 2020 2 次提交
    • P
      kvm: ioapic: Restrict lazy EOI update to edge-triggered interrupts · 8be8f932
      Paolo Bonzini 提交于
      Commit f458d039 ("kvm: ioapic: Lazy update IOAPIC EOI") introduces
      the following infinite loop:
      
      BUG: stack guard page was hit at 000000008f595917 \
      (stack is 00000000bdefe5a4..00000000ae2b06f5)
      kernel stack overflow (double-fault): 0000 [#1] SMP NOPTI
      RIP: 0010:kvm_set_irq+0x51/0x160 [kvm]
      Call Trace:
       irqfd_resampler_ack+0x32/0x90 [kvm]
       kvm_notify_acked_irq+0x62/0xd0 [kvm]
       kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
       ioapic_set_irq+0x20e/0x240 [kvm]
       kvm_ioapic_set_irq+0x5c/0x80 [kvm]
       kvm_set_irq+0xbb/0x160 [kvm]
       ? kvm_hv_set_sint+0x20/0x20 [kvm]
       irqfd_resampler_ack+0x32/0x90 [kvm]
       kvm_notify_acked_irq+0x62/0xd0 [kvm]
       kvm_ioapic_update_eoi_one.isra.0+0x30/0x120 [kvm]
       ioapic_set_irq+0x20e/0x240 [kvm]
       kvm_ioapic_set_irq+0x5c/0x80 [kvm]
       kvm_set_irq+0xbb/0x160 [kvm]
       ? kvm_hv_set_sint+0x20/0x20 [kvm]
      ....
      
      The re-entrancy happens because the irq state is the OR of
      the interrupt state and the resamplefd state.  That is, we don't
      want to show the state as 0 until we've had a chance to set the
      resamplefd.  But if the interrupt has _not_ gone low then
      ioapic_set_irq is invoked again, causing an infinite loop.
      
      This can only happen for a level-triggered interrupt, otherwise
      irqfd_inject would immediately set the KVM_USERSPACE_IRQ_SOURCE_ID high
      and then low.  Fortunately, in the case of level-triggered interrupts the VMEXIT already happens because
      TMR is set.  Thus, fix the bug by restricting the lazy invocation
      of the ack notifier to edge-triggered interrupts, the only ones that
      need it.
      Tested-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Reported-by: borisvk@bstnet.org
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Link: https://www.spinics.net/lists/kvm/msg213512.html
      Fixes: f458d039 ("kvm: ioapic: Lazy update IOAPIC EOI")
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207489Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8be8f932
    • S
      KVM: x86: Fixes posted interrupt check for IRQs delivery modes · 637543a8
      Suravee Suthikulpanit 提交于
      Current logic incorrectly uses the enum ioapic_irq_destination_types
      to check the posted interrupt destination types. However, the value was
      set using APIC_DM_XXX macros, which are left-shifted by 8 bits.
      
      Fixes by using the APIC_DM_FIXED and APIC_DM_LOWEST instead.
      
      Fixes: (fdcf7562 'KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes')
      Cc: Alexander Graf <graf@amazon.com>
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Message-Id: <1586239989-58305-1-git-send-email-suravee.suthikulpanit@amd.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Tested-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      637543a8
  7. 04 5月, 2020 2 次提交
  8. 01 5月, 2020 1 次提交
    • M
      KVM: arm64: Fix 32bit PC wrap-around · 0225fd5e
      Marc Zyngier 提交于
      In the unlikely event that a 32bit vcpu traps into the hypervisor
      on an instruction that is located right at the end of the 32bit
      range, the emulation of that instruction is going to increment
      PC past the 32bit range. This isn't great, as userspace can then
      observe this value and get a bit confused.
      
      Conversly, userspace can do things like (in the context of a 64bit
      guest that is capable of 32bit EL0) setting PSTATE to AArch64-EL0,
      set PC to a 64bit value, change PSTATE to AArch32-USR, and observe
      that PC hasn't been truncated. More confusion.
      
      Fix both by:
      - truncating PC increments for 32bit guests
      - sanitizing all 32bit regs every time a core reg is changed by
        userspace, and that PSTATE indicates a 32bit mode.
      
      Cc: stable@vger.kernel.org
      Acked-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      0225fd5e
  9. 30 4月, 2020 2 次提交
  10. 25 4月, 2020 2 次提交
  11. 24 4月, 2020 1 次提交
  12. 23 4月, 2020 3 次提交
  13. 21 4月, 2020 8 次提交
    • P
      KVM: SVM: avoid infinite loop on NPF from bad address · e72436bc
      Paolo Bonzini 提交于
      When a nested page fault is taken from an address that does not have
      a memslot associated to it, kvm_mmu_do_page_fault returns RET_PF_EMULATE
      (via mmu_set_spte) and kvm_mmu_page_fault then invokes svm_need_emulation_on_page_fault.
      
      The default answer there is to return false, but in this case this just
      causes the page fault to be retried ad libitum.  Since this is not a
      fast path, and the only other case where it is taken is an erratum,
      just stick a kvm_vcpu_gfn_to_memslot check in there to detect the
      common case where the erratum is not happening.
      
      This fixes an infinite loop in the new set_memory_region_test.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e72436bc
    • T
      KVM: Remove redundant argument to kvm_arch_vcpu_ioctl_run · 1b94f6f8
      Tianjia Zhang 提交于
      In earlier versions of kvm, 'kvm_run' was an independent structure
      and was not included in the vcpu structure. At present, 'kvm_run'
      is already included in the vcpu structure, so the parameter
      'kvm_run' is redundant.
      
      This patch simplifies the function definition, removes the extra
      'kvm_run' parameter, and extracts it from the 'kvm_vcpu' structure
      if necessary.
      Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
      Message-Id: <20200416051057.26526-1-tianjia.zhang@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1b94f6f8
    • K
      KVM: nSVM: Check for CR0.CD and CR0.NW on VMRUN of nested guests · 4f233371
      Krish Sadhukhan 提交于
      According to section "Canonicalization and Consistency Checks" in APM vol. 2,
      the following guest state combination is illegal:
      
      	"CR0.CD is zero and CR0.NW is set"
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Message-Id: <20200409205035.16830-2-krish.sadhukhan@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f233371
    • W
      KVM: X86: Improve latency for single target IPI fastpath · a9ab13ff
      Wanpeng Li 提交于
      IPI and Timer cause the main MSRs write vmexits in cloud environment
      observation, let's optimize virtual IPI latency more aggressively to
      inject target IPI as soon as possible.
      
      Running kvm-unit-tests/vmexit.flat IPI testing on SKX server, disable
      adaptive advance lapic timer and adaptive halt-polling to avoid the
      interference, this patch can give another 7% improvement.
      
      w/o fastpath   -> x86.c fastpath      4238 -> 3543  16.4%
      x86.c fastpath -> vmx.c fastpath      3543 -> 3293     7%
      w/o fastpath   -> vmx.c fastpath      4238 -> 3293  22.3%
      
      Cc: Haiwei Li <lihaiwei@tencent.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200410174703.1138-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a9ab13ff
    • S
      KVM: VMX: Optimize handling of VM-Entry failures in vmx_vcpu_run() · 873e1da1
      Sean Christopherson 提交于
      Mark the VM-Fail, VM-Exit on VM-Enter, and #MC on VM-Enter paths as
      'unlikely' so as to improve code generation so that it favors successful
      VM-Enter.  The performance of successful VM-Enter is for more important,
      irrespective of whether or not success is actually likely.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200410174703.1138-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      873e1da1
    • S
      KVM: nVMX: Remove non-functional "support" for CR3 target values · b8d295f9
      Sean Christopherson 提交于
      Remove all references to cr3_target_value[0-3] and replace the fields
      in vmcs12 with "dead_space" to preserve the vmcs12 layout.  KVM doesn't
      support emulating CR3-target values, despite a variety of code that
      implies otherwise, as KVM unconditionally reports '0' for the number of
      supported CR3-target values.
      
      This technically fixes a bug where KVM would incorrectly allow VMREAD
      and VMWRITE to nonexistent fields, i.e. cr3_target_value[0-3].  Per
      Intel's SDM, the number of supported CR3-target values reported in
      VMX_MISC also enumerates the existence of the associated VMCS fields:
      
        If a future implementation supports more than 4 CR3-target values, they
        will be encoded consecutively following the 4 encodings given here.
      
      Alternatively, the "bug" could be fixed by actually advertisting support
      for 4 CR3-target values, but that'd likely just enable kvm-unit-tests
      given that no one has complained about lack of support for going on ten
      years, e.g. KVM, Xen and HyperV don't use CR3-target values.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200416000739.9012-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b8d295f9
    • P
      KVM: x86/mmu: Avoid an extra memslot lookup in try_async_pf() for L2 · c36b7150
      Paolo Bonzini 提交于
      Create a new function kvm_is_visible_memslot() and use it from
      kvm_is_visible_gfn(); use the new function in try_async_pf() too,
      to avoid an extra memslot lookup.
      
      Opportunistically squish a multi-line comment into a single-line comment.
      
      Note, the end result, KVM_PFN_NOSLOT, is unchanged.
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c36b7150
    • S
      KVM: x86/mmu: Set @writable to false for non-visible accesses by L2 · c583eed6
      Sean Christopherson 提交于
      Explicitly set @writable to false in try_async_pf() if the GFN->PFN
      translation is short-circuited due to the requested GFN not being
      visible to L2.
      
      Leaving @writable ('map_writable' in the callers) uninitialized is ok
      in that it's never actually consumed, but one has to track it all the
      way through set_spte() being short-circuited by set_mmio_spte() to
      understand that the uninitialized variable is benign, and relying on
      @writable being ignored is an unnecessary risk.  Explicitly setting
      @writable also aligns try_async_pf() with __gfn_to_pfn_memslot().
      
      Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200415214414.10194-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c583eed6