1. 06 9月, 2021 3 次提交
    • E
      kvm: x86: Increase KVM_SOFT_MAX_VCPUS to 710 · 1dbaf04c
      Eduardo Habkost 提交于
      Support for 710 VCPUs was tested by Red Hat since RHEL-8.4,
      so increase KVM_SOFT_MAX_VCPUS to 710.
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-4-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1dbaf04c
    • E
      kvm: x86: Increase MAX_VCPUS to 1024 · 074c82c8
      Eduardo Habkost 提交于
      Increase KVM_MAX_VCPUS to 1024, so we can test larger VMs.
      
      I'm not changing KVM_SOFT_MAX_VCPUS yet because I'm afraid it
      might involve complicated questions around the meaning of
      "supported" and "recommended" in the upstream tree.
      KVM_SOFT_MAX_VCPUS will be changed in a separate patch.
      
      For reference, visible effects of this change are:
      - KVM_CAP_MAX_VCPUS will now return 1024 (of course)
      - Default value for CPUID[HYPERV_CPUID_IMPLEMENT_LIMITS (00x40000005)].EAX
        will now be 1024
      - KVM_MAX_VCPU_ID will change from 1151 to 4096
      - Size of struct kvm will increase from 19328 to 22272 bytes
        (in x86_64)
      - Size of struct kvm_ioapic will increase from 1780 to 5084 bytes
        (in x86_64)
      - Bitmap stack variables that will grow:
        - At kvm_hv_flush_tlb() kvm_hv_send_ipi(),
          vp_bitmap[] and vcpu_bitmap[] will now be 128 bytes long
        - vcpu_bitmap at bioapic_write_indirect() will be 128 bytes long
          once patch "KVM: x86: Fix stack-out-of-bounds memory access
          from ioapic_write_indirect()" is applied
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-3-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      074c82c8
    • E
      kvm: x86: Set KVM_MAX_VCPU_ID to 4*KVM_MAX_VCPUS · 4ddacd52
      Eduardo Habkost 提交于
      Instead of requiring KVM_MAX_VCPU_ID to be manually increased
      every time we increase KVM_MAX_VCPUS, set it to 4*KVM_MAX_VCPUS.
      This should be enough for CPU topologies where Cores-per-Package
      and Packages-per-Socket are not powers of 2.
      
      In practice, this increases KVM_MAX_VCPU_ID from 1023 to 1152.
      The only side effect of this change is making some fields in
      struct kvm_ioapic larger, increasing the struct size from 1628 to
      1780 bytes (in x86_64).
      Signed-off-by: NEduardo Habkost <ehabkost@redhat.com>
      Message-Id: <20210903211600.2002377-2-ehabkost@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4ddacd52
  2. 21 8月, 2021 8 次提交
    • W
      KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host · cb0f722a
      Wei Huang 提交于
      When the 5-level page table CPU flag is set in the host, but the guest
      has CR4.LA57=0 (including the case of a 32-bit guest), the top level of
      the shadow NPT page tables will be fixed, consisting of one pointer to
      a lower-level table and 511 non-present entries.  Extend the existing
      code that creates the fixed PML4 or PDP table, to provide a fixed PML5
      table if needed.
      
      This is not needed on EPT because the number of layers in the tables
      is specified in the EPTP instead of depending on the host CR4.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NWei Huang <wei.huang2@amd.com>
      Message-Id: <20210818165549.3771014-3-wei.huang2@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb0f722a
    • W
      KVM: x86: Allow CPU to force vendor-specific TDP level · 746700d2
      Wei Huang 提交于
      AMD future CPUs will require a 5-level NPT if host CR4.LA57 is set.
      To prevent kvm_mmu_get_tdp_level() from incorrectly changing NPT level
      on behalf of CPUs, add a new parameter in kvm_configure_mmu() to force
      a fixed TDP level.
      Signed-off-by: NWei Huang <wei.huang2@amd.com>
      Message-Id: <20210818165549.3771014-2-wei.huang2@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      746700d2
    • M
      KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ · 61e5f69e
      Maxim Levitsky 提交于
      KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
      while running.
      
      This change is mostly intended for more robust single stepping
      of the guest and it has the following benefits when enabled:
      
      * Resuming from a breakpoint is much more reliable.
        When resuming execution from a breakpoint, with interrupts enabled,
        more often than not, KVM would inject an interrupt and make the CPU
        jump immediately to the interrupt handler and eventually return to
        the breakpoint, to trigger it again.
      
        From the user point of view it looks like the CPU never executed a
        single instruction and in some cases that can even prevent forward
        progress, for example, when the breakpoint is placed by an automated
        script (e.g lx-symbols), which does something in response to the
        breakpoint and then continues the guest automatically.
        If the script execution takes enough time for another interrupt to
        arrive, the guest will be stuck on the same breakpoint RIP forever.
      
      * Normal single stepping is much more predictable, since it won't
        land the debugger into an interrupt handler.
      
      * RFLAGS.TF has less chance to be leaked to the guest:
      
        We set that flag behind the guest's back to do single stepping
        but if single step lands us into an interrupt/exception handler
        it will be leaked to the guest in the form of being pushed
        to the stack.
        This doesn't completely eliminate this problem as exceptions
        can still happen, but at least this reduces the chances
        of this happening.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61e5f69e
    • M
      KVM: x86/mmu: Add detailed page size stats · 71f51d2c
      Mingwei Zhang 提交于
      Existing KVM code tracks the number of large pages regardless of their
      sizes. Therefore, when large page of 1GB (or larger) is adopted, the
      information becomes less useful because lpages counts a mix of 1G and 2M
      pages.
      
      So remove the lpages since it is easy for user space to aggregate the info.
      Instead, provide a comprehensive page stats of all sizes from 4K to 512G.
      Suggested-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Cc: Jing Zhang <jingzhangos@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Message-Id: <20210803044607.599629-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71f51d2c
    • V
      KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use · 0f250a64
      Vitaly Kuznetsov 提交于
      APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
      SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
      however, possible to track whether the feature was actually used by the
      guest and only inhibit APICv/AVIC when needed.
      
      TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
      Windows know that AutoEOI feature should be avoided. While it's up to
      KVM userspace to set the flag, KVM can help a bit by exposing global
      APICv/AVIC enablement.
      
      Maxim:
         - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
           since this feature can be used regardless of AVIC
      
      Paolo:
         - use arch.apicv_update_lock to protect the hv->synic_auto_eoi_used
           instead of atomic ops
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-12-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f250a64
    • M
      KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM · b0a1637f
      Maxim Levitsky 提交于
      Currently on SVM, the kvm_request_apicv_update toggles the APICv
      memslot without doing any synchronization.
      
      If there is a mismatch between that memslot state and the AVIC state,
      on one of the vCPUs, an APIC mmio access can be lost:
      
      For example:
      
      VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
      VCPU1: access an APIC mmio register.
      
      Since AVIC is still disabled on VCPU1, the access will not be intercepted
      by it, and neither will it cause MMIO fault, but rather it will just be
      read/written from/to the dummy page mapped into the
      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.
      
      Fix that by adding a lock guarding the AVIC state changes, and carefully
      order the operations of kvm_request_apicv_update to avoid this race:
      
      1. Take the lock
      2. Send KVM_REQ_APICV_UPDATE
      3. Update the apic inhibit reason
      4. Release the lock
      
      This ensures that at (2) all vCPUs are kicked out of the guest mode,
      but don't yet see the new avic state.
      Then only after (4) all other vCPUs can update their AVIC state and resume.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b0a1637f
    • M
      KVM: x86: don't disable APICv memslot when inhibited · 36222b11
      Maxim Levitsky 提交于
      Thanks to the former patches, it is now possible to keep the APICv
      memslot always enabled, and it will be invisible to the guest
      when it is inhibited
      
      This code is based on a suggestion from Sean Christopherson:
      https://lkml.org/lkml/2021/7/19/2970Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36222b11
    • P
      KVM: X86: Introduce kvm_mmu_slot_lpages() helpers · 4139b197
      Peter Xu 提交于
      Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
      The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
      than fetching from the memslot pointer.  Start to use the latter one in
      kvm_alloc_memslot_metadata().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-4-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4139b197
  3. 13 8月, 2021 4 次提交
    • U
      KVM: x86: Move declaration of kvm_spurious_fault() to x86.h · 65297341
      Uros Bizjak 提交于
      Move the declaration of kvm_spurious_fault() to KVM's "private" x86.h,
      it should never be called by anything other than low level KVM code.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
      [sean: rebased to a series without __ex()/__kvm_handle_fault_on_reboot()]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210809173955.1710866-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      65297341
    • S
      KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot() · ad0577c3
      Sean Christopherson 提交于
      Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
      VMX and SVM instructions use asm goto to handle the fault (or in the
      case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
      asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
      flow that invoked it from assembly code.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Cc: Like Xu <like.xu.linux@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210809173955.1710866-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad0577c3
    • L
      KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD · 34e9f860
      Lai Jiangshan 提交于
      Commit ae561ede ("KVM: x86: DR0-DR3 are not clear on reset") added code to
      ensure eff_db are updated when they're modified through non-standard paths.
      
      But there is no reason to also update hardware DRs unless hardware breakpoints
      are active or DR exiting is disabled, and in those cases updating hardware is
      handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.
      
      KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
      to be removed.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      34e9f860
    • S
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson 提交于
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
  4. 05 8月, 2021 1 次提交
    • P
      KVM: xen: do not use struct gfn_to_hva_cache · 319afe68
      Paolo Bonzini 提交于
      gfn_to_hva_cache is not thread-safe, so it is usually used only within
      a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
      implementation has such a cache in kvm->arch, but it is not really
      used except to store the location of the shared info page.  Replace
      shinfo_set and shinfo_cache with just the value that is passed via
      KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
      initialization value is not zero anymore and therefore kvm_xen_init_vm
      needs to be introduced.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      319afe68
  5. 04 8月, 2021 1 次提交
    • L
      KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces · e79f49c3
      Like Xu 提交于
      Based on our observations, after any vm-exit associated with vPMU, there
      are at least two or more perf interfaces to be called for guest counter
      emulation, such as perf_event_{pause, read_value, period}(), and each one
      will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
      more severe when guest use counters in a multiplexed manner.
      
      Holding a lock once and completing the KVM request operations in the perf
      context would introduce a set of impractical new interfaces. So we can
      further optimize the vPMU implementation by avoiding repeated calls to
      these interfaces in the KVM context for at least one pattern:
      
      After we call perf_event_pause() once, the event will be disabled and its
      internal count will be reset to 0. So there is no need to pause it again
      or read its value. Once the event is paused, event period will not be
      updated until the next time it's resumed or reprogrammed. And there is
      also no need to call perf_event_period twice for a non-running counter,
      considering the perf_event for a running counter is never paused.
      
      Based on this implementation, for the following common usage of
      sampling 4 events using perf on a 4u8g guest:
      
        echo 0 > /proc/sys/kernel/watchdog
        echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
        echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
        echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
        for i in `seq 1 1 10`
        do
        taskset -c 0 perf record \
        -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
        /root/br_instr a
        done
      
      the average latency of the guest NMI handler is reduced from
      37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
      Also, in addition to collecting more samples, no loss of sampling
      accuracy was observed compared to before the optimization.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20210728120705.6855-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      e79f49c3
  6. 03 8月, 2021 1 次提交
  7. 02 8月, 2021 2 次提交
  8. 25 6月, 2021 10 次提交
    • M
      KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled · a01b45e9
      Maxim Levitsky 提交于
      This better reflects the purpose of this variable on AMD, since
      on AMD the AVIC's memory slot can be enabled and disabled dynamically.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210623113002.111448-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a01b45e9
    • A
      kvm: x86: Allow userspace to handle emulation errors · 19238e75
      Aaron Lewis 提交于
      Add a fallback mechanism to the in-kernel instruction emulator that
      allows userspace the opportunity to process an instruction the emulator
      was unable to.  When the in-kernel instruction emulator fails to process
      an instruction it will either inject a #UD into the guest or exit to
      userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
      not know how to proceed in an appropriate manner.  This feature lets
      userspace get involved to see if it can figure out a better path
      forward.
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Reviewed-by: NDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19238e75
    • S
      KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic · 7cd138db
      Sean Christopherson 提交于
      Drop the pre-computed last_nonleaf_level, which is arguably wrong and at
      best confusing.  Per the comment:
      
        Can have large pages at levels 2..last_nonleaf_level-1.
      
      the intent of the variable would appear to be to track what levels can
      _legally_ have large pages, but that intent doesn't align with reality.
      The computed value will be wrong for 5-level paging, or if 1gb pages are
      not supported.
      
      The flawed code is not a problem in practice, because except for 32-bit
      PSE paging, bit 7 is reserved if large pages aren't supported at the
      level.  Take advantage of this invariant and simply omit the level magic
      math for 64-bit page tables (including PAE).
      
      For 32-bit paging (non-PAE), the adjustments are needed purely because
      bit 7 is ignored if PSE=0.  Retain that logic as is, but make
      is_last_gpte() unique per PTTYPE so that the PSE check is avoided for
      PAE and EPT paging.  In the spirit of avoiding branches, bump the "last
      nonleaf level" for 32-bit PSE paging by adding the PSE bit itself.
      
      Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite
      FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are
      not PTEs and will blow up all the other gpte helpers.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-51-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7cd138db
    • S
      KVM: x86: Enhance comments for MMU roles and nested transition trickiness · 616007c8
      Sean Christopherson 提交于
      Expand the comments for the MMU roles.  The interactions with gfn_track
      PGD reuse in particular are hairy.
      
      Regarding PGD reuse, add comments in the nested virtualization flows to
      call out why kvm_init_mmu() is unconditionally called even when nested
      TDP is used.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-50-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      616007c8
    • S
      KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers · a4c93252
      Sean Christopherson 提交于
      Drop kvm_mmu.nx as there no consumers left.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-39-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a4c93252
    • S
      KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans · 167f8a5c
      Sean Christopherson 提交于
      Rename "nxe" to "efer_nx" so that future macro magic can use the pattern
      <reg>_<bit> for all CR0, CR4, and EFER bits that included in the role.
      Using "efer_nx" also makes it clear that the role bit reflects EFER.NX,
      not the NX bit in the corresponding PTE.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-25-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      167f8a5c
    • S
      Revert "KVM: MMU: record maximum physical address width in kvm_mmu_extended_role" · 6c032f12
      Sean Christopherson 提交于
      Drop MAXPHYADDR from mmu_role now that all MMUs have their role
      invalidated after a CPUID update.  Invalidating the role forces all MMUs
      to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can
      only be changed only through a CPUID update.
      
      This reverts commit de3ccd26.
      
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c032f12
    • S
      KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190
      Sean Christopherson 提交于
      Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
      instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
      the vCPU has been run at least once when its CPUID model is changed.
      
      KVM does not correctly handle changes to paging related settings in the
      guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
      could theoretically zap all shadow pages, but actually making that happen
      is a mess due to lock inversion (vcpu->mutex is held).  And even then,
      updating paging settings on the fly would only work if all vCPUs are
      stopped, updated in concert with identical settings, then restarted.
      
      To support running vCPUs with different vCPU models (that affect paging),
      KVM would need to track all relevant information in kvm_mmu_page_role.
      Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
      isn't sufficient as a vCPU can reuse a shadow page translation that was
      created by a vCPU with different settings and thus completely skip the
      reserved bit checks (that are tied to CPUID).
      
      Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
      it would require doubling gfn_track from a u16 to a u32, i.e. would
      increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
      E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
      would all need to be tracked.
      
      In practice, there is no remotely sane use case for changing any paging
      related CPUID entries on the fly, so just sweep it under the rug (after
      yelling at userspace).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      63f5a190
    • S
      KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified · 49c6f875
      Sean Christopherson 提交于
      Invalidate all MMUs' roles after a CPUID update to force reinitizliation
      of the MMU context/helpers.  Despite the efforts of commit de3ccd26
      ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"),
      there are still a handful of CPUID-based properties that affect MMU
      behavior but are not incorporated into mmu_role.  E.g. 1gb hugepage
      support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all
      factor into the guest's reserved PTE bits.
      
      The obvious alternative would be to add all such properties to mmu_role,
      but doing so provides no benefit over simply forcing a reinitialization
      on every CPUID update, as setting guest CPUID is a rare operation.
      
      Note, reinitializing all MMUs after a CPUID update does not fix all of
      KVM's woes.  Specifically, kvm_mmu_page_role doesn't track the CPUID
      properties, which means that a vCPU can reuse shadow pages that should
      not exist for the new vCPU model, e.g. that map GPAs that are now illegal
      (due to MAXPHYADDR changes) or that set bits that are now reserved
      (PAGE_SIZE for 1gb pages), etc...
      
      Tracking the relevant CPUID properties in kvm_mmu_page_role would address
      the majority of problems, but fully tracking that much state in the
      shadow page role comes with an unpalatable cost as it would require a
      non-trivial increase in KVM's memory footprint.  The GBPAGES case is even
      worse, as neither Intel nor AMD provides a way to disable 1gb hugepage
      support in the hardware page walker, i.e. it's a virtualization hole that
      can't be closed when using TDP.
      
      In other words, resetting the MMU after a CPUID update is largely a
      superficial fix.  But, it will allow reverting the tracking of MAXPHYADDR
      in the mmu_role, and that case in particular needs to mostly work because
      KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging
      is supported.  For cases where KVM botches guest behavior, the damage is
      limited to that guest.  But for the shadow_root_level, a misconfigured
      MMU can cause KVM to incorrectly access memory, e.g. due to walking off
      the end of its shadow page tables.
      
      Fixes: 7dcd5755 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
      Cc: Yu Zhang <yu.c.zhang@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49c6f875
    • S
      Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack" · f71a53d1
      Sean Christopherson 提交于
      Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested
      virtualization.  When KVM (L0) is using TDP, CR4.LA57 is not reflected in
      mmu_role.base.level because that tracks the shadow root level, i.e. TDP
      level.  Normally, this is not an issue because LA57 can't be toggled
      while long mode is active, i.e. the guest has to first disable paging,
      then toggle LA57, then re-enable paging, thus ensuring an MMU
      reinitialization.
      
      But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57
      without having to bounce through an unpaged section.  L1 can also load a
      new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a
      single entry PML5 is sufficient.  Such shenanigans are only problematic
      if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets
      reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode.
      
      Note, in the L2 case with nested TDP, even though L1 can switch between
      L2s with different LA57 settings, thus bypassing the paging requirement,
      in that case KVM's nested_mmu will track LA57 in base.level.
      
      This reverts commit 8053f924.
      
      Fixes: 8053f924 ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f71a53d1
  9. 24 6月, 2021 1 次提交
  10. 18 6月, 2021 9 次提交