1. 18 3月, 2021 1 次提交
    • V
      KVM: x86: hyper-v: Track Hyper-V TSC page status · cc9cfddb
      Vitaly Kuznetsov 提交于
      Create an infrastructure for tracking Hyper-V TSC page status, i.e. if it
      was updated from guest/host side or if we've failed to set it up (because
      e.g. guest wrote some garbage to HV_X64_MSR_REFERENCE_TSC) and there's no
      need to retry.
      
      Also, in a hypothetical situation when we are in 'always catchup' mode for
      TSC we can now avoid contending 'hv->hv_lock' on every guest enter by
      setting the state to HV_TSC_PAGE_BROKEN after compute_tsc_page_parameters()
      returns false.
      
      Check for HV_TSC_PAGE_SET state instead of '!hv->tsc_ref.tsc_sequence' in
      get_time_ref_counter() to properly handle the situation when we failed to
      write the updated TSC page values to the guest.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210316143736.964151-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cc9cfddb
  2. 17 3月, 2021 6 次提交
  3. 13 3月, 2021 2 次提交
    • W
      KVM: LAPIC: Advancing the timer expiration on guest initiated write · 35737d2d
      Wanpeng Li 提交于
      Advancing the timer expiration should only be necessary on guest initiated
      writes. When we cancel the timer and clear .pending during state restore,
      clear expired_tscdeadline as well.
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1614818118-965-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35737d2d
    • S
      KVM: x86/mmu: Skip !MMU-present SPTEs when removing SP in exclusive mode · 8df9f1af
      Sean Christopherson 提交于
      If mmu_lock is held for write, don't bother setting !PRESENT SPTEs to
      REMOVED_SPTE when recursively zapping SPTEs as part of shadow page
      removal.  The concurrent write protections provided by REMOVED_SPTE are
      not needed, there are no backing page side effects to record, and MMIO
      SPTEs can be left as is since they are protected by the memslot
      generation, not by ensuring that the MMIO SPTE is unreachable (which
      is racy with respect to lockless walks regardless of zapping behavior).
      
      Skipping !PRESENT drastically reduces the number of updates needed to
      tear down sparsely populated MMUs, e.g. when tearing down a 6gb VM that
      didn't touch much memory, 6929/7168 (~96.6%) of SPTEs were '0' and could
      be skipped.
      
      Avoiding the write itself is likely close to a wash, but avoiding
      __handle_changed_spte() is a clear-cut win as that involves saving and
      restoring all non-volatile GPRs (it's a subtly big function), as well as
      several conditional branches before bailing out.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210310003029.1250571-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8df9f1af
  4. 10 3月, 2021 1 次提交
  5. 06 3月, 2021 1 次提交
  6. 05 3月, 2021 2 次提交
  7. 03 3月, 2021 4 次提交
    • B
      KVM: SVM: Clear the CR4 register on reset · 9e46f6c6
      Babu Moger 提交于
      This problem was reported on a SVM guest while executing kexec.
      Kexec fails to load the new kernel when the PCID feature is enabled.
      
      When kexec starts loading the new kernel, it starts the process by
      resetting the vCPU's and then bringing each vCPU online one by one.
      The vCPU reset is supposed to reset all the register states before the
      vCPUs are brought online. However, the CR4 register is not reset during
      this process. If this register is already setup during the last boot,
      all the flags can remain intact. The X86_CR4_PCIDE bit can only be
      enabled in long mode. So, it must be enabled much later in SMP
      initialization.  Having the X86_CR4_PCIDE bit set during SMP boot can
      cause a boot failures.
      
      Fix the issue by resetting the CR4 register in init_vmcb().
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Message-Id: <161471109108.30811.6392805173629704166.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9e46f6c6
    • D
      KVM: x86/xen: Add support for vCPU runstate information · 30b5c851
      David Woodhouse 提交于
      This is how Xen guests do steal time accounting. The hypervisor records
      the amount of time spent in each of running/runnable/blocked/offline
      states.
      
      In the Xen accounting, a vCPU is still in state RUNSTATE_running while
      in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
      does the state become RUNSTATE_blocked. In KVM this means that even when
      the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.
      
      The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
      KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
      amount of time to the blocked state and subtract it from the running
      state.
      
      The state_entry_time corresponds to get_kvmclock_ns() at the time the
      vCPU entered the current state, and the total times of all four states
      should always add up to state_entry_time.
      Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      30b5c851
    • D
      KVM: x86/xen: Fix return code when clearing vcpu_info and vcpu_time_info · 7d7c5f76
      David Woodhouse 提交于
      When clearing the per-vCPU shared regions, set the return value to zero
      to indicate success. This was causing spurious errors to be returned to
      userspace on soft reset.
      
      Also add a paranoid BUILD_BUG_ON() for compat structure compatibility.
      
      Fixes: 0c165b3c ("KVM: x86/xen: Allow reset of Xen attributes")
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20210301125309.874953-1-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d7c5f76
    • P
      KVM: x86: allow compiling out the Xen hypercall interface · b59b153d
      Paolo Bonzini 提交于
      The Xen hypercall interface adds to the attack surface of the hypervisor
      and will be used quite rarely.  Allow compiling it out.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b59b153d
  8. 26 2月, 2021 3 次提交
    • P
      KVM: xen: flush deferred static key before checking it · c462f859
      Paolo Bonzini 提交于
      A missing flush would cause the static branch to trigger incorrectly.
      
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c462f859
    • S
      KVM: x86/mmu: Set SPTE_AD_WRPROT_ONLY_MASK if and only if PML is enabled · 44ac5958
      Sean Christopherson 提交于
      Check that PML is actually enabled before setting the mask to force a
      SPTE to be write-protected.  The bits used for the !AD_ENABLED case are
      in the upper half of the SPTE.  With 64-bit paging and EPT, these bits
      are ignored, but with 32-bit PAE paging they are reserved.  Setting them
      for L2 SPTEs without checking PML breaks NPT on 32-bit KVM.
      
      Fixes: 1f4e5fc8 ("KVM: x86: fix nested guest live migration with PML")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210225204749.1512652-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44ac5958
    • W
      KVM: x86: hyper-v: Fix Hyper-V context null-ptr-deref · 919f4ebc
      Wanpeng Li 提交于
      Reported by syzkaller:
      
          KASAN: null-ptr-deref in range [0x0000000000000140-0x0000000000000147]
          CPU: 1 PID: 8370 Comm: syz-executor859 Not tainted 5.11.0-syzkaller #0
          RIP: 0010:synic_get arch/x86/kvm/hyperv.c:165 [inline]
          RIP: 0010:kvm_hv_set_sint_gsi arch/x86/kvm/hyperv.c:475 [inline]
          RIP: 0010:kvm_hv_irq_routing_update+0x230/0x460 arch/x86/kvm/hyperv.c:498
          Call Trace:
           kvm_set_irq_routing+0x69b/0x940 arch/x86/kvm/../../../virt/kvm/irqchip.c:223
           kvm_vm_ioctl+0x12d0/0x2800 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3959
           vfs_ioctl fs/ioctl.c:48 [inline]
           __do_sys_ioctl fs/ioctl.c:753 [inline]
           __se_sys_ioctl fs/ioctl.c:739 [inline]
           __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739
           do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
           entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Hyper-V context is lazily allocated until Hyper-V specific MSRs are accessed
      or SynIC is enabled. However, the syzkaller testcase sets irq routing table
      directly w/o enabling SynIC. This results in null-ptr-deref when accessing
      SynIC Hyper-V context. This patch fixes it.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=163342ccd00000
      
      Reported-by: syzbot+6987f3b2dbd9eda95f12@syzkaller.appspotmail.com
      Fixes: 8f014550 ("KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional")
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1614326399-5762-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      919f4ebc
  9. 25 2月, 2021 1 次提交
    • S
      KVM: SVM: Fix nested VM-Exit on #GP interception handling · 2df8d380
      Sean Christopherson 提交于
      Fix the interpreation of nested_svm_vmexit()'s return value when
      synthesizing a nested VM-Exit after intercepting an SVM instruction while
      L2 was running.  The helper returns '0' on success, whereas a return
      value of '0' in the exit handler path means "exit to userspace".  The
      incorrect return value causes KVM to exit to userspace without filling
      the run state, e.g. QEMU logs "KVM: unknown exit, hardware reason 0".
      
      Fixes: 14c2bf81 ("KVM: SVM: Fix #GP handling for doubly-nested virtualization")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210224005627.657028-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2df8d380
  10. 24 2月, 2021 1 次提交
  11. 23 2月, 2021 3 次提交
    • D
      KVM: x86/mmu: Consider the hva in mmu_notifier retry · 4a42d848
      David Stevens 提交于
      Track the range being invalidated by mmu_notifier and skip page fault
      retries if the fault address is not affected by the in-progress
      invalidation. Handle concurrent invalidations by finding the minimal
      range which includes all ranges being invalidated. Although the combined
      range may include unrelated addresses and cannot be shrunk as individual
      invalidation operations complete, it is unlikely the marginal gains of
      proper range tracking are worth the additional complexity.
      
      The primary benefit of this change is the reduction in the likelihood of
      extreme latency when handing a page fault due to another thread having
      been preempted while modifying host virtual addresses.
      Signed-off-by: NDavid Stevens <stevensd@chromium.org>
      Message-Id: <20210222024522.1751719-3-stevensd@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4a42d848
    • S
      KVM: x86/mmu: Skip mmu_notifier check when handling MMIO page fault · 5f8a7cf2
      Sean Christopherson 提交于
      Don't retry a page fault due to an mmu_notifier invalidation when
      handling a page fault for a GPA that did not resolve to a memslot, i.e.
      an MMIO page fault.  Invalidations from the mmu_notifier signal a change
      in a host virtual address (HVA) mapping; without a memslot, there is no
      HVA and thus no possibility that the invalidation is relevant to the
      page fault being handled.
      
      Note, the MMIO vs. memslot generation checks handle the case where a
      pending memslot will create a memslot overlapping the faulting GPA.  The
      mmu_notifier checks are orthogonal to memslot updates.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210222024522.1751719-2-stevensd@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f8a7cf2
    • P
      KVM: nSVM: prepare guest save area while is_guest_mode is true · d2df592f
      Paolo Bonzini 提交于
      Right now, enter_svm_guest_mode is calling nested_prepare_vmcb_save and
      nested_prepare_vmcb_control.  This results in is_guest_mode being false
      until the end of nested_prepare_vmcb_control.
      
      This is a problem because nested_prepare_vmcb_save can in turn cause
      changes to the intercepts and these have to be applied to the "host VMCB"
      (stored in svm->nested.hsave) and then merged with the VMCB12 intercepts
      into svm->vmcb.
      
      In particular, without this change we forget to set the CR0 read and CR0
      write intercepts when running a real mode L2 guest with NPT disabled.
      The guest is therefore able to see the CR0.PG bit that KVM sets to
      enable "paged real mode".  This patch fixes the svm.flat mode_switch
      test case with npt=0.  There are no other problematic calls in
      nested_prepare_vmcb_save.
      
      Moving is_guest_mode to the end is done since commit 06fc7772
      ("KVM: SVM: Activate nested state only when guest state is complete",
      2010-04-25).  However, back then KVM didn't grab a different VMCB
      when updating the intercepts, it had already copied/merged L1's stuff
      to L0's VMCB, and then updated L0's VMCB regardless of is_nested().
      Later recalc_intercepts was introduced in commit 384c6368
      ("KVM: SVM: Add function to recalculate intercept masks", 2011-01-12).
      This introduced the bug, because recalc_intercepts now throws away
      the intercept manipulations that svm_set_cr0 had done in the meanwhile
      to svm->vmcb.
      
      [1] https://lore.kernel.org/kvm/1266493115-28386-1-git-send-email-joerg.roedel@amd.com/Reviewed-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d2df592f
  12. 19 2月, 2021 13 次提交
  13. 18 2月, 2021 2 次提交