1. 28 9月, 2020 14 次提交
    • S
      KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested · 5bcaf3e1
      Sean Christopherson 提交于
      Condition the accounting of a disallowed huge NX page on the original
      requested level of the page being greater than the current iterator
      level.  This does two things: accounts the page if and only if a huge
      page was actually disallowed, and accounts the shadow page if and only
      if it was the level at which the huge page was disallowed.  For the
      latter case, the previous logic would account all shadow pages used to
      create the translation for the forced small page, e.g. even PML4, which
      can't be a huge page on current hardware, would be accounted as having
      been a disallowed huge page when using 5-level EPT.
      
      The overzealous accounting is purely a performance issue, i.e. the
      recovery thread will spuriously zap shadow pages, but otherwise the bad
      behavior is harmless.
      
      Cc: Junaid Shahid <junaids@google.com>
      Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923183735.584-6-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5bcaf3e1
    • S
      KVM: x86/mmu: Capture requested page level before NX huge page workaround · 3cf06612
      Sean Christopherson 提交于
      Apply the "huge page disallowed" adjustment of the max level only after
      capturing the original requested level.  The requested level will be
      used in a future patch to skip adding pages to the list of disallowed
      huge pages if a huge page wasn't possible anyways, e.g. if the page
      isn't mapped as a huge page in the host.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923183735.584-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3cf06612
    • S
      KVM: x86/mmu: Move "huge page disallowed" calculation into mapping helpers · 6c2fd34f
      Sean Christopherson 提交于
      Calculate huge_page_disallowed in __direct_map() and FNAME(fetch) in
      preparation for reworking the calculation so that it preserves the
      requested map level and eventually to avoid flagging a shadow page as
      being disallowed for being used as a large/huge page when it couldn't
      have been huge in the first place, e.g. because the backing page in the
      host is not large.
      
      Pass the error code into the helpers and use it to recalcuate exec and
      write_fault instead adding yet more booleans to the parameters.
      
      Opportunistically use huge_page_disallowed instead of lpage_disallowed
      to match the nomenclature used within the mapping helpers (though even
      they have existing inconsistencies).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923183735.584-4-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c2fd34f
    • S
      KVM: x86/mmu: Refactor the zap loop for recovering NX lpages · 7d919c7a
      Sean Christopherson 提交于
      Refactor the zap loop in kvm_recover_nx_lpages() to be a for loop that
      iterates on to_zap and drop the !to_zap check that leads to the in-loop
      calling of kvm_mmu_commit_zap_page().  The in-loop commit when to_zap
      hits zero is superfluous now that there's an unconditional commit after
      the loop to handle the case where lpage_disallowed_mmu_pages is emptied.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923183735.584-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7d919c7a
    • S
      KVM: x86/mmu: Commit zap of remaining invalid pages when recovering lpages · e8950569
      Sean Christopherson 提交于
      Call kvm_mmu_commit_zap_page() after exiting the "prepare zap" loop in
      kvm_recover_nx_lpages() to finish zapping pages in the unlikely event
      that the loop exited due to lpage_disallowed_mmu_pages being empty.
      Because the recovery thread drops mmu_lock() when rescheduling, it's
      possible that lpage_disallowed_mmu_pages could be emptied by a different
      thread without to_zap reaching zero despite to_zap being derived from
      the number of disallowed lpages.
      
      Fixes: 1aa9b957 ("kvm: x86: mmu: Recovery of shattered NX large pages")
      Cc: Junaid Shahid <junaids@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923183735.584-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e8950569
    • S
      KVM: x86/mmu: Bail early from final #PF handling on spurious faults · 12703759
      Sean Christopherson 提交于
      Detect spurious page faults, e.g. page faults that occur when multiple
      vCPUs simultaneously access a not-present page, and skip the SPTE write,
      prefetch, and stats update for spurious faults.
      
      Note, the performance benefits of skipping the write and prefetch are
      likely negligible, and the false positive stats adjustment is probably
      lost in the noise.  The primary motivation is to play nice with TDX's
      SEPT in the long term.  SEAMCALLs (to program SEPT entries) are quite
      costly, e.g. thousands of cycles, and a spurious SEPT update will result
      in a SEAMCALL error (which KVM will ideally treat as fatal).
      Reported-by: NKai Huang <kai.huang@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923220425.18402-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12703759
    • S
      KVM: x86/mmu: Return unique RET_PF_* values if the fault was fixed · c4371c2a
      Sean Christopherson 提交于
      Introduce RET_PF_FIXED and RET_PF_SPURIOUS to provide unique return
      values instead of overloading RET_PF_RETRY.  In the short term, the
      unique values add clarity to the code and RET_PF_SPURIOUS will be used
      by set_spte() to avoid unnecessary work for spurious faults.
      
      In the long term, TDX will use RET_PF_FIXED to deterministically map
      memory during pre-boot.  The page fault flow may bail early for benign
      reasons, e.g. if the mmu_notifier fires for an unrelated address.  With
      only RET_PF_RETRY, it's impossible for the caller to distinguish between
      "cool, page is mapped" and "darn, need to try again", and thus cannot
      handle benign cases like the mmu_notifier retry.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923220425.18402-4-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c4371c2a
    • S
      KVM: x86/mmu: Invert RET_PF_* check when falling through to emulation · 83a2ba4c
      Sean Christopherson 提交于
      Explicitly check for RET_PF_EMULATE instead of implicitly doing the same
      by checking for !RET_PF_RETRY (RET_PF_INVALID is handled earlier).  This
      will adding new RET_PF_ types in future patches without breaking the
      emulation path.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923220425.18402-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83a2ba4c
    • S
      KVM: x86/mmu: Return -EIO if page fault returns RET_PF_INVALID · 7b367bc9
      Sean Christopherson 提交于
      Exit to userspace with an error if the MMU is buggy and returns
      RET_PF_INVALID when servicing a page fault.  This will allow a future
      patch to invert the emulation path, i.e. emulate only on RET_PF_EMULATE
      instead of emulating on anything but RET_PF_RETRY.  This technically
      means that KVM will exit to userspace instead of emulating on
      RET_PF_INVALID, but practically speaking it's a nop as the MMU never
      returns RET_PF_INVALID.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923220425.18402-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7b367bc9
    • B
      KVM: x86/MMU: Recursively zap nested TDP SPs when zapping last/only parent · 2de4085c
      Ben Gardon 提交于
      Recursively zap all to-be-orphaned children, unsynced or otherwise, when
      zapping a shadow page for a nested TDP MMU.  KVM currently only zaps the
      unsynced child pages, but not the synced ones.  This can create problems
      over time when running many nested guests because it leaves unlinked
      pages which will not be freed until the page quota is hit. With the
      default page quota of 20 shadow pages per 1000 guest pages, this looks
      like a memory leak and can degrade MMU performance.
      
      In a recent benchmark, substantial performance degradation was observed:
      An L1 guest was booted with 64G memory.
      2G nested Windows guests were booted, 10 at a time for 20
      iterations. (200 total boots)
      Windows was used in this benchmark because they touch all of their
      memory on startup.
      By the end of the benchmark, the nested guests were taking ~10% longer
      to boot. With this patch there is no degradation in boot time.
      Without this patch the benchmark ends with hundreds of thousands of
      stale EPT02 pages cluttering up rmaps and the page hash map. As a
      result, VM shutdown is also much slower: deleting memslot 0 was
      observed to take over a minute. With this patch it takes just a
      few miliseconds.
      
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Co-developed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923221406.16297-3-sean.j.christopherson@intel.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2de4085c
    • S
      KVM: x86/mmu: Move flush logic from mmu_page_zap_pte() to FNAME(invlpg) · ace569e0
      Sean Christopherson 提交于
      Move the logic that controls whether or not FNAME(invlpg) needs to flush
      fully into FNAME(invlpg) so that mmu_page_zap_pte() doesn't return a
      value.  This allows a future patch to redefine the return semantics for
      mmu_page_zap_pte() so that it can recursively zap orphaned child shadow
      pages for nested TDP MMUs.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923221406.16297-2-sean.j.christopherson@intel.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ace569e0
    • S
      KVM: x86/mmu: Stash 'kvm' in a local variable in kvm_mmu_free_roots() · 4d710de9
      Sean Christopherson 提交于
      To make kvm_mmu_free_roots() a bit more readable, capture 'kvm' in a
      local variable instead of doing vcpu->kvm over and over (and over).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923191204.8410-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d710de9
    • S
      KVM: x86: Move illegal GPA helper out of the MMU code · dc46515c
      Sean Christopherson 提交于
      Rename kvm_mmu_is_illegal_gpa() to kvm_vcpu_is_illegal_gpa() and move it
      to cpuid.h so that's it's colocated with cpuid_maxphyaddr().  The helper
      is not MMU specific and will gain a user that is completely unrelated to
      the MMU in a future patch.
      
      No functional change intended.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200924194250.19137-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dc46515c
    • S
      KVM: x86: Add kvm_x86_ops hook to short circuit emulation · 09e3e2a1
      Sean Christopherson 提交于
      Replace the existing kvm_x86_ops.need_emulation_on_page_fault() with a
      more generic is_emulatable(), and unconditionally call the new function
      in x86_emulate_instruction().
      
      KVM will use the generic hook to support multiple security related
      technologies that prevent emulation in one way or another.  Similar to
      the existing AMD #NPF case where emulation of the current instruction is
      not possible due to lack of information, AMD's SEV-ES and Intel's SGX
      and TDX will introduce scenarios where emulation is impossible due to
      the guest's register state being inaccessible.  And again similar to the
      existing #NPF case, emulation can be initiated by kvm_mmu_page_fault(),
      i.e. outside of the control of vendor-specific code.
      
      While the cause and architecturally visible behavior of the various
      cases are different, e.g. SGX will inject a #UD, AMD #NPF is a clean
      resume or complete shutdown, and SEV-ES and TDX "return" an error, the
      impact on the common emulation code is identical: KVM must stop
      emulation immediately and resume the guest.
      
      Query is_emulatable() in handle_ud() as well so that the
      force_emulation_prefix code doesn't incorrectly modify RIP before
      calling emulate_instruction() in the absurdly unlikely scenario that
      KVM encounters forced emulation in conjunction with "do not emulate".
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200915232702.15945-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09e3e2a1
  2. 12 9月, 2020 1 次提交
    • L
      kvm x86/mmu: use KVM_REQ_MMU_SYNC to sync when needed · f6f6195b
      Lai Jiangshan 提交于
      When kvm_mmu_get_page() gets a page with unsynced children, the spt
      pagetable is unsynchronized with the guest pagetable. But the
      guest might not issue a "flush" operation on it when the pagetable
      entry is changed from zero or other cases. The hypervisor has the
      responsibility to synchronize the pagetables.
      
      KVM behaved as above for many years, But commit 8c8560b8
      ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      inadvertently included a line of code to change it without giving any
      reason in the changelog. It is clear that the commit's intention was to
      change KVM_REQ_TLB_FLUSH -> KVM_REQ_TLB_FLUSH_CURRENT, so we don't
      needlessly flush other contexts; however, one of the hunks changed
      a nearby KVM_REQ_MMU_SYNC instead.  This patch changes it back.
      
      Link: https://lore.kernel.org/lkml/20200320212833.3507-26-sean.j.christopherson@intel.com/
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20200902135421.31158-1-jiangshanlai@gmail.com>
      fixes: 8c8560b8 ("KVM: x86/mmu: Use KVM_REQ_TLB_FLUSH_CURRENT for MMU specific flushes")
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f6f6195b
  3. 24 8月, 2020 1 次提交
  4. 22 8月, 2020 1 次提交
    • W
      KVM: Pass MMU notifier range flags to kvm_unmap_hva_range() · fdfe7cbd
      Will Deacon 提交于
      The 'flags' field of 'struct mmu_notifier_range' is used to indicate
      whether invalidate_range_{start,end}() are permitted to block. In the
      case of kvm_mmu_notifier_invalidate_range_start(), this field is not
      forwarded on to the architecture-specific implementation of
      kvm_unmap_hva_range() and therefore the backend cannot sensibly decide
      whether or not to block.
      
      Add an extra 'flags' parameter to kvm_unmap_hva_range() so that
      architectures are aware as to whether or not they are permitted to block.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <maz@kernel.org>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Message-Id: <20200811102725.7121-2-will@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdfe7cbd
  5. 31 7月, 2020 5 次提交
  6. 17 7月, 2020 1 次提交
  7. 11 7月, 2020 6 次提交
    • M
      KVM: x86: mmu: Add guest physical address check in translate_gpa() · ec7771ab
      Mohammed Gamal 提交于
      Intel processors of various generations have supported 36, 39, 46 or 52
      bits for physical addresses.  Until IceLake introduced MAXPHYADDR==52,
      running on a machine with higher MAXPHYADDR than the guest more or less
      worked, because software that relied on reserved address bits (like KVM)
      generally used bit 51 as a marker and therefore the page faults where
      generated anyway.
      
      Unfortunately this is not true anymore if the host MAXPHYADDR is 52,
      and this can cause problems when migrating from a MAXPHYADDR<52
      machine to one with MAXPHYADDR==52.  Typically, the latter are machines
      that support 5-level page tables, so they can be identified easily from
      the LA57 CPUID bit.
      
      When that happens, the guest might have a physical address with reserved
      bits set, but the host won't see that and trap it.  Hence, we need
      to check page faults' physical addresses against the guest's maximum
      physical memory and if it's exceeded, we need to add the PFERR_RSVD_MASK
      bits to the page fault error code.
      
      This patch does this for the MMU's page walks.  The next patches will
      ensure that the correct exception and error code is produced whenever
      no host-reserved bits are set in page table entries.
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20200710154811.418214-4-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec7771ab
    • M
      KVM: x86: mmu: Move translate_gpa() to mmu.c · cd313569
      Mohammed Gamal 提交于
      Also no point of it being inline since it's always called through
      function pointers. So remove that.
      Signed-off-by: NMohammed Gamal <mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20200710154811.418214-3-mgamal@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd313569
    • V
      KVM: x86: drop superfluous mmu_check_root() from fast_pgd_switch() · fe9304d3
      Vitaly Kuznetsov 提交于
      The mmu_check_root() check in fast_pgd_switch() seems to be
      superfluous: when GPA is outside of the visible range
      cached_root_available() will fail for non-direct roots
      (as we can't have a matching one on the list) and we don't
      seem to care for direct ones.
      
      Also, raising #TF immediately when a non-existent GFN is written to CR3
      doesn't seem to mach architectural behavior. Drop the check.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200710141157.1640173-10-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fe9304d3
    • V
      KVM: nSVM: implement nested_svm_load_cr3() and use it for host->guest switch · a506fdd2
      Vitaly Kuznetsov 提交于
      Undesired triple fault gets injected to L1 guest on SVM when L2 is
      launched with certain CR3 values. #TF is raised by mmu_check_root()
      check in fast_pgd_switch() and the root cause is that when
      kvm_set_cr3() is called from nested_prepare_vmcb_save() with NPT
      enabled CR3 points to a nGPA so we can't check it with
      kvm_is_visible_gfn().
      
      Using generic kvm_set_cr3() when switching to nested guest is not
      a great idea as we'll have to distinguish between 'real' CR3s and
      'nested' CR3s to e.g. not call kvm_mmu_new_pgd() with nGPA. Following
      nVMX implement nested-specific nested_svm_load_cr3() doing the job.
      
      To support the change, nested_svm_load_cr3() needs to be re-ordered
      with nested_svm_init_mmu_context().
      
      Note: the current implementation is sub-optimal as we always do TLB
      flush/MMU sync but this is still an improvement as we at least stop doing
      kvm_mmu_reset_context().
      
      Fixes: 7c390d35 ("kvm: x86: Add fast CR3 switch code path")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200710141157.1640173-8-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a506fdd2
    • P
      KVM: MMU: stop dereferencing vcpu->arch.mmu to get the context for MMU init · 8c008659
      Paolo Bonzini 提交于
      kvm_init_shadow_mmu() was actually the only function that could be called
      with different vcpu->arch.mmu values.  Now that kvm_init_shadow_npt_mmu()
      is separated from kvm_init_shadow_mmu(), we always know the MMU context
      we need to use and there is no need to dereference vcpu->arch.mmu pointer.
      
      Based on a patch by Vitaly Kuznetsov <vkuznets@redhat.com>.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200710141157.1640173-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8c008659
    • V
      KVM: nSVM: split kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu() · 0f04a2ac
      Vitaly Kuznetsov 提交于
      As a preparatory change for moving kvm_mmu_new_pgd() from
      nested_prepare_vmcb_save() to nested_svm_init_mmu_context() split
      kvm_init_shadow_npt_mmu() from kvm_init_shadow_mmu(). This also makes
      the code look more like nVMX (kvm_init_shadow_ept_mmu()).
      
      No functional change intended.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200710141157.1640173-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f04a2ac
  8. 10 7月, 2020 11 次提交