1. 09 1月, 2020 6 次提交
  2. 21 11月, 2019 1 次提交
  3. 15 11月, 2019 1 次提交
    • L
      KVM: x86: Optimization: Requst TLB flush in fast_cr3_switch() instead of do it directly · 1924242b
      Liran Alon 提交于
      When KVM emulates a nested VMEntry (L1->L2 VMEntry), it switches mmu root
      page. If nEPT is used, this will happen from
      kvm_init_shadow_ept_mmu()->__kvm_mmu_new_cr3() and otherwise it will
      happpen from nested_vmx_load_cr3()->kvm_mmu_new_cr3(). Either case,
      __kvm_mmu_new_cr3() will use fast_cr3_switch() in attempt to switch to a
      previously cached root page.
      
      In case fast_cr3_switch() finds a matching cached root page, it will
      set it in mmu->root_hpa and request KVM_REQ_LOAD_CR3 such that on
      next entry to guest, KVM will set root HPA in appropriate hardware
      fields (e.g. vmcs->eptp). In addition, fast_cr3_switch() calls
      kvm_x86_ops->tlb_flush() in order to flush TLB as MMU root page
      was replaced.
      
      This works as mmu->root_hpa, which vmx_flush_tlb() use, was
      already replaced in cached_root_available(). However, this may
      result in unnecessary INVEPT execution because a KVM_REQ_TLB_FLUSH
      may have already been requested. For example, by prepare_vmcs02()
      in case L1 don't use VPID.
      
      Therefore, change fast_cr3_switch() to just request TLB flush on
      next entry to guest.
      Reviewed-by: NBhavesh Davda <bhavesh.davda@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1924242b
  4. 14 11月, 2019 1 次提交
    • S
      KVM: x86/mmu: Take slots_lock when using kvm_mmu_zap_all_fast() · ed69a6cb
      Sean Christopherson 提交于
      Acquire the per-VM slots_lock when zapping all shadow pages as part of
      toggling nx_huge_pages.  The fast zap algorithm relies on exclusivity
      (via slots_lock) to identify obsolete vs. valid shadow pages, because it
      uses a single bit for its generation number. Holding slots_lock also
      obviates the need to acquire a read lock on the VM's srcu.
      
      Failing to take slots_lock when toggling nx_huge_pages allows multiple
      instances of kvm_mmu_zap_all_fast() to run concurrently, as the other
      user, KVM_SET_USER_MEMORY_REGION, does not take the global kvm_lock.
      (kvm_mmu_zap_all_fast() does take kvm->mmu_lock, but it can be
      temporarily dropped by kvm_zap_obsolete_pages(), so it is not enough
      to enforce exclusivity).
      
      Concurrent fast zap instances causes obsolete shadow pages to be
      incorrectly identified as valid due to the single bit generation number
      wrapping, which results in stale shadow pages being left in KVM's MMU
      and leads to all sorts of undesirable behavior.
      The bug is easily confirmed by running with CONFIG_PROVE_LOCKING and
      toggling nx_huge_pages via its module param.
      
      Note, until commit 4ae5acbc4936 ("KVM: x86/mmu: Take slots_lock when
      using kvm_mmu_zap_all_fast()", 2019-11-13) the fast zap algorithm used
      an ulong-sized generation instead of relying on exclusivity for
      correctness, but all callers except the recently added set_nx_huge_pages()
      needed to hold slots_lock anyways.  Therefore, this patch does not have
      to be backported to stable kernels.
      
      Given that toggling nx_huge_pages is by no means a fast path, force it
      to conform to the current approach instead of reintroducing the previous
      generation count.
      
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation", but NOT FOR STABLE)
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed69a6cb
  5. 13 11月, 2019 1 次提交
  6. 12 11月, 2019 1 次提交
    • S
      KVM: MMU: Do not treat ZONE_DEVICE pages as being reserved · a78986aa
      Sean Christopherson 提交于
      Explicitly exempt ZONE_DEVICE pages from kvm_is_reserved_pfn() and
      instead manually handle ZONE_DEVICE on a case-by-case basis.  For things
      like page refcounts, KVM needs to treat ZONE_DEVICE pages like normal
      pages, e.g. put pages grabbed via gup().  But for flows such as setting
      A/D bits or shifting refcounts for transparent huge pages, KVM needs to
      to avoid processing ZONE_DEVICE pages as the flows in question lack the
      underlying machinery for proper handling of ZONE_DEVICE pages.
      
      This fixes a hang reported by Adam Borowski[*] in dev_pagemap_cleanup()
      when running a KVM guest backed with /dev/dax memory, as KVM straight up
      doesn't put any references to ZONE_DEVICE pages acquired by gup().
      
      Note, Dan Williams proposed an alternative solution of doing put_page()
      on ZONE_DEVICE pages immediately after gup() in order to simplify the
      auditing needed to ensure is_zone_device_page() is called if and only if
      the backing device is pinned (via gup()).  But that approach would break
      kvm_vcpu_{un}map() as KVM requires the page to be pinned from map() 'til
      unmap() when accessing guest memory, unlike KVM's secondary MMU, which
      coordinates with mmu_notifier invalidations to avoid creating stale
      page references, i.e. doesn't rely on pages being pinned.
      
      [*] http://lkml.kernel.org/r/20190919115547.GA17963@angband.plReported-by: NAdam Borowski <kilobyte@angband.pl>
      Analyzed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: stable@vger.kernel.org
      Fixes: 3565fce3 ("mm, x86: get_user_pages() for dax mappings")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a78986aa
  7. 05 11月, 2019 1 次提交
  8. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  9. 27 9月, 2019 2 次提交
    • P
      KVM: x86: fix nested guest live migration with PML · 1f4e5fc8
      Paolo Bonzini 提交于
      Shadow paging is fundamentally incompatible with the page-modification
      log, because the GPAs in the log come from the wrong memory map.
      In particular, for the EPT page-modification log, the GPAs in the log come
      from L2 rather than L1.  (If there was a non-EPT page-modification log,
      we couldn't use it for shadow paging because it would log GVAs rather
      than GPAs).
      
      Therefore, we need to rely on write protection to record dirty pages.
      This has the side effect of bypassing PML, since writes now result in an
      EPT violation vmexit.
      
      This is relatively easy to add to KVM, because pretty much the only place
      that needs changing is spte_clear_dirty.  The first access to the page
      already goes through the page fault path and records the correct GPA;
      it's only subsequent accesses that are wrong.  Therefore, we can equip
      set_spte (where the first access happens) to record that the SPTE will
      have to be write protected, and then spte_clear_dirty will use this
      information to do the right thing.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1f4e5fc8
    • P
      KVM: x86: assign two bits to track SPTE kinds · 6eeb4ef0
      Paolo Bonzini 提交于
      Currently, we are overloading SPTE_SPECIAL_MASK to mean both
      "A/D bits unavailable" and MMIO, where the difference between the
      two is determined by mio_mask and mmio_value.
      
      However, the next patch will need two bits to distinguish
      availability of A/D bits from write protection.  So, while at
      it give MMIO its own bit pattern, and move the two bits from
      bit 62 to bits 52..53 since Intel is allocating EPT page table
      bits from the top.
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6eeb4ef0
  10. 24 9月, 2019 10 次提交
  11. 14 9月, 2019 1 次提交
    • S
      KVM: x86/mmu: Reintroduce fast invalidate/zap for flushing memslot · 002c5f73
      Sean Christopherson 提交于
      James Harvey reported a livelock that was introduced by commit
      d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when
      removing a memslot"").
      
      The livelock occurs because kvm_mmu_zap_all() as it exists today will
      voluntarily reschedule and drop KVM's mmu_lock, which allows other vCPUs
      to add shadow pages.  With enough vCPUs, kvm_mmu_zap_all() can get stuck
      in an infinite loop as it can never zap all pages before observing lock
      contention or the need to reschedule.  The equivalent of kvm_mmu_zap_all()
      that was in use at the time of the reverted commit (4e103134, "KVM:
      x86/mmu: Zap only the relevant pages when removing a memslot") employed
      a fast invalidate mechanism and was not susceptible to the above livelock.
      
      There are three ways to fix the livelock:
      
      - Reverting the revert (commit d012a06a) is not a viable option as
        the revert is needed to fix a regression that occurs when the guest has
        one or more assigned devices.  It's unlikely we'll root cause the device
        assignment regression soon enough to fix the regression timely.
      
      - Remove the conditional reschedule from kvm_mmu_zap_all().  However, although
        removing the reschedule would be a smaller code change, it's less safe
        in the sense that the resulting kvm_mmu_zap_all() hasn't been used in
        the wild for flushing memslots since the fast invalidate mechanism was
        introduced by commit 6ca18b69 ("KVM: x86: use the fast way to
        invalidate all pages"), back in 2013.
      
      - Reintroduce the fast invalidate mechanism and use it when zapping shadow
        pages in response to a memslot being deleted/moved, which is what this
        patch does.
      
      For all intents and purposes, this is a revert of commit ea145aac
      ("Revert "KVM: MMU: fast invalidate all pages"") and a partial revert of
      commit 7390de1e ("Revert "KVM: x86: use the fast way to invalidate
      all pages""), i.e. restores the behavior of commit 5304b8d3 ("KVM:
      MMU: fast invalidate all pages") and commit 6ca18b69 ("KVM: x86:
      use the fast way to invalidate all pages") respectively.
      
      Fixes: d012a06a ("Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot"")
      Reported-by: NJames Harvey <jamespharvey20@gmail.com>
      Cc: Alex Willamson <alex.williamson@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      002c5f73
  12. 12 9月, 2019 1 次提交
    • J
      kvm: Nested KVM MMUs need PAE root too · 1cfff4d9
      Jiří Paleček 提交于
      On AMD processors, in PAE 32bit mode, nested KVM instances don't
      work. The L0 host get a kernel OOPS, which is related to
      arch.mmu->pae_root being NULL.
      
      The reason for this is that when setting up nested KVM instance,
      arch.mmu is set to &arch.guest_mmu (while normally, it would be
      &arch.root_mmu). However, the initialization and allocation of
      pae_root only creates it in root_mmu. KVM code (ie. in
      mmu_alloc_shadow_roots) then accesses arch.mmu->pae_root, which is the
      unallocated arch.guest_mmu->pae_root.
      
      This fix just allocates (and frees) pae_root in both guest_mmu and
      root_mmu (and also lm_root if it was allocated). The allocation is
      subject to previous restrictions ie. it won't allocate anything on
      64-bit and AFAIK not on Intel.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=203923
      Fixes: 14c07ad8 ("x86/kvm/mmu: introduce guest_mmu")
      Signed-off-by: NJiri Palecek <jpalecek@web.de>
      Tested-by: NJiri Palecek <jpalecek@web.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1cfff4d9
  13. 22 8月, 2019 2 次提交
    • S
      KVM: x86/mmu: Consolidate "is MMIO SPTE" code · 26c44a63
      Sean Christopherson 提交于
      Replace the open-coded "is MMIO SPTE" checks in the MMU warnings
      related to software-based access/dirty tracking to make the code
      slightly more self-documenting.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      26c44a63
    • S
      KVM: x86/mmu: Add explicit access mask for MMIO SPTEs · 4af77151
      Sean Christopherson 提交于
      When shadow paging is enabled, KVM tracks the allowed access type for
      MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
      without having to walk the guest's page tables.  The tracking is done
      by retaining the WRITE and USER bits of the access when inserting the
      MMIO SPTE (read access is implicitly allowed), which allows the MMIO
      page fault handler to retrieve and cache the WRITE/USER bits from the
      SPTE.
      
      Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
      hardcoded using the x86 paging versions of the bits.  This funkiness
      happens to work because KVM uses a completely different mask/value for
      MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
      overlap exactly with the x86 WRITE/USER bits[*].
      
      Explicitly define the access mask for MMIO SPTEs to accurately reflect
      that EPT does not want to incorporate any access bits into the SPTE, and
      so that KVM isn't subtly relying on EPT's WX bits always being set in
      MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
      horribly.
      
      Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
      TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
      EPT has no (known) functional impact.
      
      [*] Using WX to generate EPT misconfigurations (equivalent to reserved
          bit page fault) ensures KVM can employ its MMIO page fault tricks
          even platforms without reserved address bits.
      
      Fixes: ce88decf ("KVM: MMU: mmio page fault support")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4af77151
  14. 21 8月, 2019 1 次提交
  15. 24 7月, 2019 1 次提交
  16. 15 7月, 2019 1 次提交
    • A
      x86: kvm: avoid constant-conversion warning · a6a6d3b1
      Arnd Bergmann 提交于
      clang finds a contruct suspicious that converts an unsigned
      character to a signed integer and back, causing an overflow:
      
      arch/x86/kvm/mmu.c:4605:39: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -205 to 51 [-Werror,-Wconstant-conversion]
                      u8 wf = (pfec & PFERR_WRITE_MASK) ? ~w : 0;
                         ~~                               ^~
      arch/x86/kvm/mmu.c:4607:38: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -241 to 15 [-Werror,-Wconstant-conversion]
                      u8 uf = (pfec & PFERR_USER_MASK) ? ~u : 0;
                         ~~                              ^~
      arch/x86/kvm/mmu.c:4609:39: error: implicit conversion from 'int' to 'u8' (aka 'unsigned char') changes value from -171 to 85 [-Werror,-Wconstant-conversion]
                      u8 ff = (pfec & PFERR_FETCH_MASK) ? ~x : 0;
                         ~~                               ^~
      
      Add an explicit cast to tell clang that everything works as
      intended here.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Link: https://github.com/ClangBuiltLinux/linux/issues/95Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a6a6d3b1
  17. 13 7月, 2019 1 次提交
  18. 05 7月, 2019 5 次提交
  19. 19 6月, 2019 2 次提交