1. 22 10月, 2021 3 次提交
  2. 19 10月, 2021 5 次提交
    • O
      KVM: x86: Expose TSC offset controls to userspace · 828ca896
      Oliver Upton 提交于
      To date, VMM-directed TSC synchronization and migration has been a bit
      messy. KVM has some baked-in heuristics around TSC writes to infer if
      the VMM is attempting to synchronize. This is problematic, as it depends
      on host userspace writing to the guest's TSC within 1 second of the last
      write.
      
      A much cleaner approach to configuring the guest's views of the TSC is to
      simply migrate the TSC offset for every vCPU. Offsets are idempotent,
      and thus not subject to change depending on when the VMM actually
      reads/writes values from/to KVM. The VMM can then read the TSC once with
      KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
      the guest is paused.
      
      Cc: David Matlack <dmatlack@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210916181538.968978-8-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      828ca896
    • O
      KVM: x86: Refactor tsc synchronization code · 58d4277b
      Oliver Upton 提交于
      Refactor kvm_synchronize_tsc to make a new function that allows callers
      to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly
      for the sake of participating in TSC synchronization.
      Signed-off-by: NOliver Upton <oupton@google.com>
      Message-Id: <20210916181538.968978-7-oupton@google.com>
      [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are
       equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by
       Maxim Levitsky. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      58d4277b
    • P
      kvm: x86: protect masterclock with a seqcount · 869b4421
      Paolo Bonzini 提交于
      Protect the reference point for kvmclock with a seqcount, so that
      kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
      updates will also run in parallel and not bounce the kvmclock cacheline.
      
      Of the variables that were protected by pvclock_gtod_sync_lock,
      nr_vcpus_matched_tsc is different because it is updated outside
      pvclock_update_vm_gtod_copy and read inside it.  Therefore, we
      need to keep it protected by a spinlock.  In fact it must now
      be a raw spinlock, because pvclock_update_vm_gtod_copy, being the
      write-side of a seqcount, is non-preemptible.  Since we already
      have tsc_write_lock which is a raw spinlock, we can just use
      tsc_write_lock as the lock that protects the write-side of the
      seqcount.
      Co-developed-by: NOliver Upton <oupton@google.com>
      Message-Id: <20210916181538.968978-6-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      869b4421
    • O
      KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK · c68dc1b5
      Oliver Upton 提交于
      Handling the migration of TSCs correctly is difficult, in part because
      Linux does not provide userspace with the ability to retrieve a (TSC,
      realtime) clock pair for a single instant in time. In lieu of a more
      convenient facility, KVM can report similar information in the kvm_clock
      structure.
      
      Provide userspace with a host TSC & realtime pair iff the realtime clock
      is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
      realtime value, advance the KVM clock by the amount of elapsed time. Do
      not step the KVM clock backwards, though, as it is a monotonic
      oscillator.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210916181538.968978-5-oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c68dc1b5
    • P
      KVM: X86: fix lazy allocation of rmaps · fa13843d
      Paolo Bonzini 提交于
      If allocation of rmaps fails, but some of the pointers have already been written,
      those pointers can be cleaned up when the memslot is freed, or even reused later
      for another attempt at allocating the rmaps.  Therefore there is no need to
      WARN, as done for example in memslot_rmap_alloc, but the allocation *must* be
      skipped lest KVM will overwrite the previous pointer and will indeed leak memory.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fa13843d
  3. 01 10月, 2021 6 次提交
  4. 30 9月, 2021 12 次提交
  5. 22 9月, 2021 6 次提交
    • F
      kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[] · e1fc1553
      Fares Mehanna 提交于
      Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
      consistent behavior between Intel and AMD when using KVM_GET_MSRS,
      KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.
      
      We have to add legacy and new MSRs to handle guests running without
      X86_FEATURE_PERFCTR_CORE.
      Signed-off-by: NFares Mehanna <faresx@amazon.de>
      Message-Id: <20210915133951.22389-1-faresx@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e1fc1553
    • M
      KVM: x86: reset pdptrs_from_userspace when exiting smm · 37687c40
      Maxim Levitsky 提交于
      When exiting SMM, pdpts are loaded again from the guest memory.
      
      This fixes a theoretical bug, when exit from SMM triggers entry to the
      nested guest which re-uses some of the migration
      code which uses this flag as a workaround for a legacy userspace.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37687c40
    • S
      KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry · 94c245a2
      Sean Christopherson 提交于
      Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is
      shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU
      that's guaranteed to exist).  Using kvm_get_vcpu() to find vCPU works,
      but it's a rather odd and suboptimal method to check the index of a given
      vCPU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210910183220.2397812-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94c245a2
    • H
      KVM: x86: Handle SRCU initialization failure during page track init · eb7511bf
      Haimin Zhang 提交于
      Check the return of init_srcu_struct(), which can fail due to OOM, when
      initializing the page track mechanism.  Lack of checking leads to a NULL
      pointer deref found by a modified syzkaller.
      Reported-by: NTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com>
      Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
      [Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eb7511bf
    • S
      KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT · 03a6e840
      Sean Christopherson 提交于
      Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
      Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
      RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
      likely escaped notice because no firmware/kernel puts its page tables root
      at PA=0, let alone relies on INIT to get the desired CR3 for such page
      tables.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03a6e840
    • S
      KVM: x86: Mark all registers as avail/dirty at vCPU creation · 7117003f
      Sean Christopherson 提交于
      Mark all registers as available and dirty at vCPU creation, as the vCPU has
      obviously not been loaded into hardware, let alone been given the chance to
      be modified in hardware.  On SVM, reading from "uninitialized" hardware is
      a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
      hardware does not allow for arbitrary field encoding schemes.
      
      On VMX, backing memory for VMCSes is also zero allocated, but true
      initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
      architectural specification technically allows CPU implementations to
      encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
      the inverted value of every field, which would result in VMREAD to a
      zero-allocated field returns all ones.
      
      In practice, only the AR_BYTES fields are known to be manipulated by
      hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
      does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
      other words, this is technically a bug fix, but practically speakings it's
      a glorified nop.
      
      Failure to mark registers as available has been a lurking bug for quite
      some time.  The original register caching supported only GPRs (+RIP, which
      is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
      worked because the two cacheable registers, RIP and RSP, are generally
      speaking not read as side effects in other flows.
      
      Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on
      demand") was the first instance of failure to mark regs available.  While
      _just_ marking CR3 available during vCPU creation wouldn't have fixed the
      VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
      unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
      reading GUEST_CR3 when it's available would have avoided VMREAD to a
      technically-uninitialized VMCS.
      
      Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand")
      Fixes: 6de4f3ad ("KVM: Cache pdptrs")
      Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()")
      Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields")
      Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0")
      Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4")
      Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
      Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7117003f
  6. 06 9月, 2021 1 次提交
  7. 21 8月, 2021 7 次提交
    • M
      KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ · 61e5f69e
      Maxim Levitsky 提交于
      KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
      while running.
      
      This change is mostly intended for more robust single stepping
      of the guest and it has the following benefits when enabled:
      
      * Resuming from a breakpoint is much more reliable.
        When resuming execution from a breakpoint, with interrupts enabled,
        more often than not, KVM would inject an interrupt and make the CPU
        jump immediately to the interrupt handler and eventually return to
        the breakpoint, to trigger it again.
      
        From the user point of view it looks like the CPU never executed a
        single instruction and in some cases that can even prevent forward
        progress, for example, when the breakpoint is placed by an automated
        script (e.g lx-symbols), which does something in response to the
        breakpoint and then continues the guest automatically.
        If the script execution takes enough time for another interrupt to
        arrive, the guest will be stuck on the same breakpoint RIP forever.
      
      * Normal single stepping is much more predictable, since it won't
        land the debugger into an interrupt handler.
      
      * RFLAGS.TF has less chance to be leaked to the guest:
      
        We set that flag behind the guest's back to do single stepping
        but if single step lands us into an interrupt/exception handler
        it will be leaked to the guest in the form of being pushed
        to the stack.
        This doesn't completely eliminate this problem as exceptions
        can still happen, but at least this reduces the chances
        of this happening.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61e5f69e
    • M
      KVM: x86/mmu: Add detailed page size stats · 71f51d2c
      Mingwei Zhang 提交于
      Existing KVM code tracks the number of large pages regardless of their
      sizes. Therefore, when large page of 1GB (or larger) is adopted, the
      information becomes less useful because lpages counts a mix of 1G and 2M
      pages.
      
      So remove the lpages since it is easy for user space to aggregate the info.
      Instead, provide a comprehensive page stats of all sizes from 4K to 512G.
      Suggested-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Cc: Jing Zhang <jingzhangos@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Message-Id: <20210803044607.599629-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71f51d2c
    • J
      KVM: stats: Support linear and logarithmic histogram statistics · f95937cc
      Jing Zhang 提交于
      Add new types of KVM stats, linear and logarithmic histogram.
      Histogram are very useful for observing the value distribution
      of time or size related stats.
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f95937cc
    • M
      KVM: SVM: avoid refreshing avic if its state didn't change · 06ef8134
      Maxim Levitsky 提交于
      Since AVIC can be inhibited and uninhibited rapidly it is possible that
      we have nothing to do by the time the svm_refresh_apicv_exec_ctrl
      is called.
      
      Detect and avoid this, which will be useful when we will start calling
      avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06ef8134
    • M
      KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM · b0a1637f
      Maxim Levitsky 提交于
      Currently on SVM, the kvm_request_apicv_update toggles the APICv
      memslot without doing any synchronization.
      
      If there is a mismatch between that memslot state and the AVIC state,
      on one of the vCPUs, an APIC mmio access can be lost:
      
      For example:
      
      VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
      VCPU1: access an APIC mmio register.
      
      Since AVIC is still disabled on VCPU1, the access will not be intercepted
      by it, and neither will it cause MMIO fault, but rather it will just be
      read/written from/to the dummy page mapped into the
      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.
      
      Fix that by adding a lock guarding the AVIC state changes, and carefully
      order the operations of kvm_request_apicv_update to avoid this race:
      
      1. Take the lock
      2. Send KVM_REQ_APICV_UPDATE
      3. Update the apic inhibit reason
      4. Release the lock
      
      This ensures that at (2) all vCPUs are kicked out of the guest mode,
      but don't yet see the new avic state.
      Then only after (4) all other vCPUs can update their AVIC state and resume.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b0a1637f
    • M
      KVM: x86: don't disable APICv memslot when inhibited · 36222b11
      Maxim Levitsky 提交于
      Thanks to the former patches, it is now possible to keep the APICv
      memslot always enabled, and it will be invisible to the guest
      when it is inhibited
      
      This code is based on a suggestion from Sean Christopherson:
      https://lkml.org/lkml/2021/7/19/2970Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36222b11
    • P
      KVM: X86: Introduce kvm_mmu_slot_lpages() helpers · 4139b197
      Peter Xu 提交于
      Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
      The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
      than fetching from the memslot pointer.  Start to use the latter one in
      kvm_alloc_memslot_metadata().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-4-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4139b197