1. 19 11月, 2022 17 次提交
  2. 18 11月, 2022 8 次提交
    • M
      KVM: x86: remove exit_int_info warning in svm_handle_exit · 05311ce9
      Maxim Levitsky 提交于
      It is valid to receive external interrupt and have broken IDT entry,
      which will lead to #GP with exit_int_into that will contain the index of
      the IDT entry (e.g any value).
      
      Other exceptions can happen as well, like #NP or #SS
      (if stack switch fails).
      
      Thus this warning can be user triggred and has very little value.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-10-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05311ce9
    • M
      KVM: x86: allow L1 to not intercept triple fault · 92e7d5c8
      Maxim Levitsky 提交于
      This is SVM correctness fix - although a sane L1 would intercept
      SHUTDOWN event, it doesn't have to, so we have to honour this.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      92e7d5c8
    • M
      KVM: x86: forcibly leave nested mode on vCPU reset · ed129ec9
      Maxim Levitsky 提交于
      While not obivous, kvm_vcpu_reset() leaves the nested mode by clearing
      'vcpu->arch.hflags' but it does so without all the required housekeeping.
      
      On SVM,	it is possible to have a vCPU reset while in guest mode because
      unlike VMX, on SVM, INIT's are not latched in SVM non root mode and in
      addition to that L1 doesn't have to intercept triple fault, which should
      also trigger L1's reset if happens in L2 while L1 didn't intercept it.
      
      If one of the above conditions happen, KVM will	continue to use vmcb02
      while not having in the guest mode.
      
      Later the IA32_EFER will be cleared which will lead to freeing of the
      nested guest state which will (correctly) free the vmcb02, but since
      KVM still uses it (incorrectly) this will lead to a use after free
      and kernel crash.
      
      This issue is assigned CVE-2022-3344
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-5-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed129ec9
    • M
      KVM: x86: add kvm_leave_nested · f9697df2
      Maxim Levitsky 提交于
      add kvm_leave_nested which wraps a call to nested_ops->leave_nested
      into a function.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f9697df2
    • M
      KVM: x86: nSVM: harden svm_free_nested against freeing vmcb02 while still in use · 16ae56d7
      Maxim Levitsky 提交于
      Make sure that KVM uses vmcb01 before freeing nested state, and warn if
      that is not the case.
      
      This is a minimal fix for CVE-2022-3344 making the kernel print a warning
      instead of a kernel panic.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-3-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16ae56d7
    • M
      KVM: x86: nSVM: leave nested mode on vCPU free · 917401f2
      Maxim Levitsky 提交于
      If the VM was terminated while nested, we free the nested state
      while the vCPU still is in nested mode.
      
      Soon a warning will be added for this condition.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20221103141351.50662-2-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      917401f2
    • D
      KVM: x86/mmu: Do not recover dirty-tracked NX Huge Pages · eb298605
      David Matlack 提交于
      Do not recover (i.e. zap) an NX Huge Page that is being dirty tracked,
      as it will just be faulted back in at the same 4KiB granularity when
      accessed by a vCPU. This may need to be changed if KVM ever supports
      2MiB (or larger) dirty tracking granularity, or faulting huge pages
      during dirty tracking for reads/executes. However for now, these zaps
      are entirely wasteful.
      
      In order to check if this commit increases the CPU usage of the NX
      recovery worker thread I used a modified version of execute_perf_test
      [1] that supports splitting guest memory into multiple slots and reports
      /proc/pid/schedstat:se.sum_exec_runtime for the NX recovery worker just
      before tearing down the VM. The goal was to force a large number of NX
      Huge Page recoveries and see if the recovery worker used any more CPU.
      
      Test Setup:
      
        echo 1000 > /sys/module/kvm/parameters/nx_huge_pages_recovery_period_ms
        echo 10 > /sys/module/kvm/parameters/nx_huge_pages_recovery_ratio
      
      Test Command:
      
        ./execute_perf_test -v64 -s anonymous_hugetlb_1gb -x 16 -o
      
              | kvm-nx-lpage-re:se.sum_exec_runtime      |
              | ---------------------------------------- |
      Run     | Before             | After               |
      ------- | ------------------ | ------------------- |
      1       | 730.084105         | 724.375314          |
      2       | 728.751339         | 740.581988          |
      3       | 736.264720         | 757.078163          |
      
      Comparing the median results, this commit results in about a 1% increase
      CPU usage of the NX recovery worker when testing a VM with 16 slots.
      However, the effect is negligible with the default halving time of NX
      pages, which is 1 hour rather than 10 seconds given by period_ms = 1000,
      ratio = 10.
      
      [1] https://lore.kernel.org/kvm/20221019234050.3919566-2-dmatlack@google.com/Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Message-Id: <20221103204421.1146958-1-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eb298605
    • P
      KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry · 63d28a25
      Paolo Bonzini 提交于
      A removed SPTE is never present, hence the "if" in kvm_tdp_mmu_map
      only fails in the exact same conditions that the earlier loop
      tested in order to issue a  "break". So, instead of checking twice the
      condition (upper level SPTEs could not be created or was frozen), just
      exit the loop with a goto---the usual poor-man C replacement for RAII
      early returns.
      
      While at it, do not use the "ret" variable for return values of
      functions that do not return a RET_PF_* enum.  This is clearer
      and also makes it possible to initialize ret to RET_PF_RETRY.
      Suggested-by: NRobert Hoo <robert.hu@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      63d28a25
  3. 17 11月, 2022 1 次提交
    • D
      KVM: x86/mmu: Split huge pages mapped by the TDP MMU on fault · c4b33d28
      David Matlack 提交于
      Now that the TDP MMU has a mechanism to split huge pages, use it in the
      fault path when a huge page needs to be replaced with a mapping at a
      lower level.
      
      This change reduces the negative performance impact of NX HugePages.
      Prior to this change if a vCPU executed from a huge page and NX
      HugePages was enabled, the vCPU would take a fault, zap the huge page,
      and mapping the faulting address at 4KiB with execute permissions
      enabled. The rest of the memory would be left *unmapped* and have to be
      faulted back in by the guest upon access (read, write, or execute). If
      guest is backed by 1GiB, a single execute instruction can zap an entire
      GiB of its physical address space.
      
      For example, it can take a VM longer to execute from its memory than to
      populate that memory in the first place:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.748378795s
      Executing from memory         : 2.899670885s
      
      With this change, such faults split the huge page instead of zapping it,
      which avoids the non-present faults on the rest of the huge page:
      
      $ ./execute_perf_test -s anonymous_hugetlb_1gb -v96
      
      Populating memory             : 2.729544474s
      Executing from memory         : 0.111965688s   <---
      
      This change also reduces the performance impact of dirty logging when
      eager_page_split=N. eager_page_split=N (abbreviated "eps=N" below) can
      be desirable for read-heavy workloads, as it avoids allocating memory to
      split huge pages that are never written and avoids increasing the TLB
      miss cost on reads of those pages.
      
                   | Config: ept=Y, tdp_mmu=Y, 5% writes           |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.332305091s   | 0.019615027s  | 0.006108211s |
      4            | 0.353096020s   | 0.019452131s  | 0.006214670s |
      8            | 0.453938562s   | 0.019748246s  | 0.006610997s |
      16           | 0.719095024s   | 0.019972171s  | 0.007757889s |
      32           | 1.698727124s   | 0.021361615s  | 0.012274432s |
      64           | 2.630673582s   | 0.031122014s  | 0.016994683s |
      96           | 3.016535213s   | 0.062608739s  | 0.044760838s |
      
      Eager page splitting remains beneficial for write-heavy workloads, but
      the gap is now reduced.
      
                   | Config: ept=Y, tdp_mmu=Y, 100% writes         |
                   | Iteration 1 dirty memory time                 |
                   | --------------------------------------------- |
      vCPU Count   | eps=N (Before) | eps=N (After) | eps=Y        |
      ------------ | -------------- | ------------- | ------------ |
      2            | 0.317710329s   | 0.296204596s  | 0.058689782s |
      4            | 0.337102375s   | 0.299841017s  | 0.060343076s |
      8            | 0.386025681s   | 0.297274460s  | 0.060399702s |
      16           | 0.791462524s   | 0.298942578s  | 0.062508699s |
      32           | 1.719646014s   | 0.313101996s  | 0.075984855s |
      64           | 2.527973150s   | 0.455779206s  | 0.079789363s |
      96           | 2.681123208s   | 0.673778787s  | 0.165386739s |
      
      Further study is needed to determine if the remaining gap is acceptable
      for customer workloads or if eager_page_split=N still requires a-priori
      knowledge of the VM workload, especially when considering these costs
      extrapolated out to large VMs with e.g. 416 vCPUs and 12TB RAM.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20221109185905.486172-3-dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c4b33d28
  4. 10 11月, 2022 14 次提交
    • P
      KVM: replace direct irq.h inclusion · d663b8a2
      Paolo Bonzini 提交于
      virt/kvm/irqchip.c is including "irq.h" from the arch-specific KVM source
      directory (i.e. not from arch/*/include) for the sole purpose of retrieving
      irqchip_in_kernel.
      
      Making the function inline in a header that is already included,
      such as asm/kvm_host.h, is not possible because it needs to look at
      struct kvm which is defined after asm/kvm_host.h is included.  So add a
      kvm_arch_irqchip_in_kernel non-inline function; irqchip_in_kernel() is
      only performance critical on arm64 and x86, and the non-inline function
      is enough on all other architectures.
      
      irq.h can then be deleted from all architectures except x86.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d663b8a2
    • L
      KVM: x86/pmu: Defer counter emulated overflow via pmc->prev_counter · de0f6195
      Like Xu 提交于
      Defer reprogramming counters and handling overflow via KVM_REQ_PMU
      when incrementing counters.  KVM skips emulated WRMSR in the VM-Exit
      fastpath, the fastpath runs with IRQs disabled, skipping instructions
      can increment and reprogram counters, reprogramming counters can
      sleep, and sleeping is disallowed while IRQs are disabled.
      
       [*] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:580
       [*] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2981888, name: CPU 15/KVM
       [*] preempt_count: 1, expected: 0
       [*] RCU nest depth: 0, expected: 0
       [*] INFO: lockdep is turned off.
       [*] irq event stamp: 0
       [*] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
       [*] hardirqs last disabled at (0): [<ffffffff8121222a>] copy_process+0x146a/0x62d0
       [*] softirqs last  enabled at (0): [<ffffffff81212269>] copy_process+0x14a9/0x62d0
       [*] softirqs last disabled at (0): [<0000000000000000>] 0x0
       [*] Preemption disabled at:
       [*] [<ffffffffc2063fc1>] vcpu_enter_guest+0x1001/0x3dc0 [kvm]
       [*] CPU: 17 PID: 2981888 Comm: CPU 15/KVM Kdump: 5.19.0-rc1-g239111db364c-dirty #2
       [*] Call Trace:
       [*]  <TASK>
       [*]  dump_stack_lvl+0x6c/0x9b
       [*]  __might_resched.cold+0x22e/0x297
       [*]  __mutex_lock+0xc0/0x23b0
       [*]  perf_event_ctx_lock_nested+0x18f/0x340
       [*]  perf_event_pause+0x1a/0x110
       [*]  reprogram_counter+0x2af/0x1490 [kvm]
       [*]  kvm_pmu_trigger_event+0x429/0x950 [kvm]
       [*]  kvm_skip_emulated_instruction+0x48/0x90 [kvm]
       [*]  handle_fastpath_set_msr_irqoff+0x349/0x3b0 [kvm]
       [*]  vmx_vcpu_run+0x268e/0x3b80 [kvm_intel]
       [*]  vcpu_enter_guest+0x1d22/0x3dc0 [kvm]
      
      Add a field to kvm_pmc to track the previous counter value in order
      to defer overflow detection to kvm_pmu_handle_event() (the counter must
      be paused before handling overflow, and that may increment the counter).
      
      Opportunistically shrink sizeof(struct kvm_pmc) a bit.
      Suggested-by: NWanpeng Li <wanpengli@tencent.com>
      Fixes: 9cd803d4 ("KVM: x86: Update vPMCs when retiring instructions")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Link: https://lore.kernel.org/r/20220831085328.45489-6-likexu@tencent.com
      [sean: avoid re-triggering KVM_REQ_PMU on overflow, tweak changelog]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de0f6195
    • L
      KVM: x86/pmu: Defer reprogram_counter() to kvm_pmu_handle_event() · 68fb4757
      Like Xu 提交于
      Batch reprogramming PMU counters by setting KVM_REQ_PMU and thus
      deferring reprogramming kvm_pmu_handle_event() to avoid reprogramming
      a counter multiple times during a single VM-Exit.
      
      Deferring programming will also allow KVM to fix a bug where immediately
      reprogramming a counter can result in sleeping (taking a mutex) while
      interrupts are disabled in the VM-Exit fastpath.
      
      Introduce kvm_pmu_request_counter_reprogam() to make it obvious that
      KVM is _requesting_ a reprogram and not actually doing the reprogram.
      
      Opportunistically refine related comments to avoid misunderstandings.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Link: https://lore.kernel.org/r/20220831085328.45489-5-likexu@tencent.comSigned-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      68fb4757
    • S
      KVM: x86/pmu: Clear "reprogram" bit if counter is disabled or disallowed · dcbb816a
      Sean Christopherson 提交于
      When reprogramming a counter, clear the counter's "reprogram pending" bit
      if the counter is disabled (by the guest) or is disallowed (by the
      userspace filter).  In both cases, there's no need to re-attempt
      programming on the next coincident KVM_REQ_PMU as enabling the counter by
      either method will trigger reprogramming.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dcbb816a
    • S
      KVM: x86/pmu: Force reprogramming of all counters on PMU filter change · f1c5651f
      Sean Christopherson 提交于
      Force vCPUs to reprogram all counters on a PMU filter change to provide
      a sane ABI for userspace.  Use the existing KVM_REQ_PMU to do the
      programming, and take advantage of the fact that the reprogram_pmi bitmap
      fits in a u64 to set all bits in a single atomic update.  Note, setting
      the bitmap and making the request needs to be done _after_ the SRCU
      synchronization to ensure that vCPUs will reprogram using the new filter.
      
      KVM's current "lazy" approach is confusing and non-deterministic.  It's
      confusing because, from a developer perspective, the code is buggy as it
      makes zero sense to let userspace modify the filter but then not actually
      enforce the new filter.  The lazy approach is non-deterministic because
      KVM enforces the filter whenever a counter is reprogrammed, not just on
      guest WRMSRs, i.e. a guest might gain/lose access to an event at random
      times depending on what is going on in the host.
      
      Note, the resulting behavior is still non-determinstic while the filter
      is in flux.  If userspace wants to guarantee deterministic behavior, all
      vCPUs should be paused during the filter update.
      
      Jim Mattson <jmattson@google.com>
      
      Fixes: 66bb8a06 ("KVM: x86: PMU Event Filter")
      Cc: Aaron Lewis <aaronlewis@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220923001355.3741194-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1c5651f
    • S
      KVM: x86/mmu: WARN if TDP MMU SP disallows hugepage after being zapped · 3a056757
      Sean Christopherson 提交于
      Extend the accounting sanity check in kvm_recover_nx_huge_pages() to the
      TDP MMU, i.e. verify that zapping a shadow page unaccounts the disallowed
      NX huge page regardless of the MMU type.  Recovery runs while holding
      mmu_lock for write and so it should be impossible to get false positives
      on the WARN.
      Suggested-by: NYan Zhao <yan.y.zhao@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20221019165618.927057-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3a056757
    • M
      KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() · 76901e56
      Mingwei Zhang 提交于
      Explicitly check if a NX huge page is disallowed when determining if a
      page fault needs to be forced to use a smaller sized page.  KVM currently
      assumes that the NX huge page mitigation is the only scenario where KVM
      will force a shadow page instead of a huge page, and so unnecessarily
      keeps an existing shadow page instead of replacing it with a huge page.
      
      Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
      that can be made huge without violating the NX huge page mitigation.
      E.g. prior to commit 5ba7c4c6 ("KVM: x86/MMU: Zap non-leaf SPTEs when
      disabling dirty logging"), KVM would keep shadow pages after disabling
      dirty logging due to a live migration being canceled, resulting in
      degraded performance due to running with 4kb pages instead of huge pages.
      
      Although the dirty logging case is "fixed", that fix is coincidental,
      i.e. is an implementation detail, and there are other scenarios where KVM
      will zap leaf SPTEs.  E.g. zapping leaf SPTEs in response to a host page
      migration (mmu_notifier invalidation) to create a huge page would yield a
      similar result; KVM would see the shadow-present non-leaf SPTE and assume
      a huge page is disallowed.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      [sean: use spte_to_child_sp(), massage changelog, fold into if-statement]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76901e56
    • S
      KVM: x86/mmu: Add helper to convert SPTE value to its shadow page · 5e3edd7e
      Sean Christopherson 提交于
      Add a helper to convert a SPTE to its shadow page to deduplicate a
      variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
      get the shadow page for a SPTE without dropping high bits.
      
      Opportunistically add a comment in mmu_free_root_page() documenting why
      it treats the root HPA as a SPTE.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20221019165618.927057-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5e3edd7e
    • S
      KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages · d25ceb92
      Sean Christopherson 提交于
      Track the number of TDP MMU "shadow" pages instead of tracking the pages
      themselves. With the NX huge page list manipulation moved out of the common
      linking flow, elminating the list-based tracking means the happy path of
      adding a shadow page doesn't need to acquire a spinlock and can instead
      inc/dec an atomic.
      
      Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
      pages is very, very useful for detecting KVM bugs.
      
      Tracking the number of pages will also make it trivial to expose the
      counter to userspace as a stat in the future, which may or may not be
      desirable.
      
      Note, the TDP MMU needs to use a separate counter (and stat if that ever
      comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
      supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
      TDP MMU consumes so few pages relative to shadow paging), and including TDP
      MMU pages in that counter would break both the shrinker and shadow MMUs,
      e.g. if a VM is using nested TDP.
      
      Cc: Yan Zhao <yan.y.zhao@intel.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d25ceb92
    • S
      KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE · 61f94478
      Sean Christopherson 提交于
      Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
      visible to other readers, i.e. before setting its SPTE.  This will allow
      KVM to query the flag when determining if a shadow page can be replaced
      by a NX huge page without violating the rules of the mitigation.
      
      Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
      for another CPU to see a shadow page without an up-to-date
      nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
      dance.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NYan Zhao <yan.y.zhao@intel.com>
      Message-Id: <20221019165618.927057-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61f94478
    • S
      KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs · b5b0977f
      Sean Christopherson 提交于
      Account and track NX huge pages for nonpaging MMUs so that a future
      enhancement to precisely check if a shadow page can't be replaced by a NX
      huge page doesn't get false positives.  Without correct tracking, KVM can
      get stuck in a loop if an instruction is fetching and writing data on the
      same huge page, e.g. KVM installs a small executable page on the fetch
      fault, replaces it with an NX huge page on the write fault, and faults
      again on the fetch.
      
      Alternatively, and perhaps ideally, KVM would simply not enforce the
      workaround for nonpaging MMUs.  The guest has no page tables to abuse
      and KVM is guaranteed to switch to a different MMU on CR0.PG being
      toggled so there's no security or performance concerns.  However, getting
      make_spte() to play nice now and in the future is unnecessarily complex.
      
      In the current code base, make_spte() can enforce the mitigation if TDP
      is enabled or the MMU is indirect, but make_spte() may not always have a
      vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
      promotion when disabling dirty logging.
      
      Without a vCPU/MMU, KVM could either pass in the correct information
      and/or derive it from the shadow page, but the former is ugly and the
      latter subtly non-trivial due to the possibility of direct shadow pages
      in indirect MMUs.  Given that using shadow paging with an unpaged guest
      is far from top priority _and_ has been subjected to the workaround since
      its inception, keep it simple and just fix the accounting glitch.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20221019165618.927057-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5b0977f
    • S
      KVM: x86/mmu: Rename NX huge pages fields/functions for consistency · 55c510e2
      Sean Christopherson 提交于
      Rename most of the variables/functions involved in the NX huge page
      mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
      vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
      applies only to the NX huge page mitigation, not to any condition that
      prevents creating a huge page.
      
      Add a comment explaining what the newly named "possible_nx_huge_pages"
      tracks.
      
      Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
      stone.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20221019165618.927057-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55c510e2
    • S
      KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked · 428e9216
      Sean Christopherson 提交于
      Tag shadow pages that cannot be replaced with an NX huge page regardless
      of whether or not zapping the page would allow KVM to immediately create
      a huge page, e.g. because something else prevents creating a huge page.
      
      I.e. track pages that are disallowed from being NX huge pages regardless
      of whether or not the page could have been huge at the time of fault.
      KVM currently tracks pages that were disallowed from being huge due to
      the NX workaround if and only if the page could otherwise be huge.  But
      that fails to handled the scenario where whatever restriction prevented
      KVM from installing a huge page goes away, e.g. if dirty logging is
      disabled, the host mapping level changes, etc...
      
      Failure to tag shadow pages appropriately could theoretically lead to
      false negatives, e.g. if a fetch fault requests a small page and thus
      isn't tracked, and a read/write fault later requests a huge page, KVM
      will not reject the huge page as it should.
      
      To avoid yet another flag, initialize the list_head and use list_empty()
      to determine whether or not a page is on the list of NX huge pages that
      should be recovered.
      
      Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
      more involved due to mmu_lock being held for read.  This will be
      addressed in a future commit.
      
      Fixes: 5bcaf3e1 ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20221019165618.927057-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      428e9216
    • A
      KVM: x86: Add a VALID_MASK for the flags in kvm_msr_filter_range · 8aff460f
      Aaron Lewis 提交于
      Add the mask KVM_MSR_FILTER_RANGE_VALID_MASK for the flags in the
      struct kvm_msr_filter_range.  This simplifies checks that validate
      these flags, and makes it easier to introduce new flags in the future.
      
      No functional change intended.
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Message-Id: <20220921151525.904162-5-aaronlewis@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8aff460f
新手
引导
客服 返回
顶部