1. 28 1月, 2020 1 次提交
  2. 23 1月, 2020 1 次提交
    • P
      KVM: x86: fix overlap between SPTE_MMIO_MASK and generation · 56871d44
      Paolo Bonzini 提交于
      The SPTE_MMIO_MASK overlaps with the bits used to track MMIO
      generation number.  A high enough generation number would overwrite the
      SPTE_SPECIAL_MASK region and cause the MMIO SPTE to be misinterpreted.
      
      Likewise, setting bits 52 and 53 would also cause an incorrect generation
      number to be read from the PTE, though this was partially mitigated by the
      (useless if it weren't for the bug) removal of SPTE_SPECIAL_MASK from
      the spte in get_mmio_spte_generation.  Drop that removal, and replace
      it with a compile-time assertion.
      
      Fixes: 6eeb4ef0 ("KVM: x86: assign two bits to track SPTE kinds")
      Reported-by: NBen Gardon <bgardon@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56871d44
  3. 21 1月, 2020 2 次提交
    • S
      KVM: x86/mmu: Apply max PA check for MMIO sptes to 32-bit KVM · e30a7d62
      Sean Christopherson 提交于
      Remove the bogus 64-bit only condition from the check that disables MMIO
      spte optimization when the system supports the max PA, i.e. doesn't have
      any reserved PA bits.  32-bit KVM always uses PAE paging for the shadow
      MMU, and per Intel's SDM:
      
        PAE paging translates 32-bit linear addresses to 52-bit physical
        addresses.
      
      The kernel's restrictions on max physical addresses are limits on how
      much memory the kernel can reasonably use, not what physical addresses
      are supported by hardware.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e30a7d62
    • S
      KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks · b5c3c1b3
      Sean Christopherson 提交于
      Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize
      the checks as much as possible.  Move the check to a separate helper,
      __is_bad_mt_xwr(), which allows the guest_rsvd_check usage in
      paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is
      always zero for non-nEPT) while retaining the bitwise-OR of the current
      code for the shadow_zero_check in walk_shadow_page_get_mmio_spte().
      
      Add a comment for the bitwise-OR usage in the mmio spte walk to avoid
      future attempts to "fix" the code, which is what prompted this
      optimization in the first place[*].
      
      Opportunistically remove the superfluous '!= 0' and parantheses, and
      use BIT_ULL() instead of open coding its equivalent.
      
      The net effect is that code generation is largely unchanged for
      walk_shadow_page_get_mmio_spte(), marginally better for
      ept_prefetch_invalid_gpte(), and significantly improved for
      paging32/64_prefetch_invalid_gpte().
      
      Note, walk_shadow_page_get_mmio_spte() can't use a templated version of
      the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that
      KVM hasn't borked its EPT tables.  Even if it could be templated, the
      benefits of having a single implementation far outweight the few uops
      that would be saved for NPT or non-TDP paging, e.g. most compilers
      inline it all the way to up kvm_mmu_page_fault().
      
      [*] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Arvind Sankar <nivedita@alum.mit.edu>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5c3c1b3
  4. 09 1月, 2020 18 次提交
  5. 21 11月, 2019 1 次提交
  6. 15 11月, 2019 1 次提交
    • L
      KVM: x86: Optimization: Requst TLB flush in fast_cr3_switch() instead of do it directly · 1924242b
      Liran Alon 提交于
      When KVM emulates a nested VMEntry (L1->L2 VMEntry), it switches mmu root
      page. If nEPT is used, this will happen from
      kvm_init_shadow_ept_mmu()->__kvm_mmu_new_cr3() and otherwise it will
      happpen from nested_vmx_load_cr3()->kvm_mmu_new_cr3(). Either case,
      __kvm_mmu_new_cr3() will use fast_cr3_switch() in attempt to switch to a
      previously cached root page.
      
      In case fast_cr3_switch() finds a matching cached root page, it will
      set it in mmu->root_hpa and request KVM_REQ_LOAD_CR3 such that on
      next entry to guest, KVM will set root HPA in appropriate hardware
      fields (e.g. vmcs->eptp). In addition, fast_cr3_switch() calls
      kvm_x86_ops->tlb_flush() in order to flush TLB as MMU root page
      was replaced.
      
      This works as mmu->root_hpa, which vmx_flush_tlb() use, was
      already replaced in cached_root_available(). However, this may
      result in unnecessary INVEPT execution because a KVM_REQ_TLB_FLUSH
      may have already been requested. For example, by prepare_vmcs02()
      in case L1 don't use VPID.
      
      Therefore, change fast_cr3_switch() to just request TLB flush on
      next entry to guest.
      Reviewed-by: NBhavesh Davda <bhavesh.davda@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1924242b
  7. 14 11月, 2019 1 次提交
    • S
      KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb
      Sean Christopherson 提交于
      Acquire the per-VM slots_lock when zapping all shadow pages as part of
      toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
      (via slots_lock) to identify obsolete vs. valid shadow pages, because it
      uses a single bit for its generation number. Holding slots_lock also
      obviates the need to acquire a read lock on the VM's srcu.
      
      Failing to take slots_lock when toggling nx_huge_pages allows multiple
      instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
      user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
      (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
      temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
      to enforce exclusivity).
      
      Concurrent fast zap instances causes obsolete shadow pages to be
      incorrectly identified as valid due to the single bit generation number
      wrapping, which results in stale shadow pages being left in KVM's MMU
      and leads to all sorts of undesirable behavior.
      The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
      toggling nx_huge_pages via its module param.
      
      Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
      using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
      an ulong-sized generation instead of relying on exclusivity for
      correctness, but all callers except the recently added set_nx_huge_pages()
      needed to hold slots_lock anyways.  Therefore, this patch does not have
      to be backported to stable kernels.
      
      Given that toggling nx_huge_pages is by no means a fast path, force it
      to conform to the current approach instead of reintroducing the previous
      generation count.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed69a6cb
  8. 13 11月, 2019 1 次提交
  9. 12 11月, 2019 1 次提交
    • S
      KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa
      Sean Christopherson 提交于
      Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
      instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
      like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
      pages, e.g. put pages grabbed via gup().  But for flows such as setting
      A/D bits or shifting refcounts for transparent huge pages, KVM needs to
      to avoid processing ZONE_DEVICE pages as the flows in question lack the
      underlying machinery for proper handling of ZONE_DEVICE pages.
      
      This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
      when running a KVM guest backed with /dev/dax memory, as KVM straight up
      doesn't put any references to ZONE_DEVICE pages acquired by gup().
      
      Note, Dan Williams proposed an alternative solution of doing put_page()
      on ZONE_DEVICE pages immediately after gup() in order to simplify the
      auditing needed to ensure is_zone_device_page() is called if and only if
      the backing device is pinned (via gup()).  But that approach would break
      kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
      unmap() when accessing guest memory, unlike KVM's secondary MMU, which
      coordinates with mmu_notifier invalidations to avoid creating stale
      page references, i.e. doesn't rely on pages being pinned.
      
      [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
      Analyzed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a78986aa
  10. 05 11月, 2019 1 次提交
  11. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  12. 27 9月, 2019 2 次提交
    • P
      KVM: x86: fix nested guest live migration with PML · 1f4e5fc8
      Paolo Bonzini 提交于
      Shadow paging is fundamentally incompatible with the page-modification
      log, because the GPAs in the log come from the wrong memory map.
      In particular, for the EPT page-modification log, the GPAs in the log come
      from L2 rather than L1.  (If there was a non-EPT page-modification log,
      we couldn't use it for shadow paging because it would log GVAs rather
      than GPAs).
      
      Therefore, we need to rely on write protection to record dirty pages.
      This has the side effect of bypassing PML, since writes now result in an
      EPT violation vmexit.
      
      This is relatively easy to add to KVM, because pretty much the only place
      that needs changing is spte_clear_dirty.  The first access to the page
      already goes through the page fault path and records the correct GPA;
      it's only subsequent accesses that are wrong.  Therefore, we can equip
      set_spte (where the first access happens) to record that the SPTE will
      have to be write protected, and then spte_clear_dirty will use this
      information to do the right thing.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1f4e5fc8
    • P
      KVM: x86: assign two bits to track SPTE kinds · 6eeb4ef0
      Paolo Bonzini 提交于
      Currently, we are overloading SPTE_SPECIAL_MASK to mean both
      "A/D bits unavailable" and MMIO, where the difference between the
      two is determined by mio_mask and mmio_value.
      
      However, the next patch will need two bits to distinguish
      availability of A/D bits from write protection.  So, while at
      it give MMIO its own bit pattern, and move the two bits from
      bit 62 to bits 52..53 since Intel is allocating EPT page table
      bits from the top.
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6eeb4ef0
  13. 24 9月, 2019 9 次提交