1. 23 2月, 2019 1 次提交
  2. 22 2月, 2019 4 次提交
    • P
      Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-next · 0a0c50f7
      Paul Mackerras 提交于
      This merges in the "ppc-kvm" topic branch of the powerpc tree to get a
      series of commits that touch both general arch/powerpc code and KVM
      code.  These commits will be merged both via the KVM tree and the
      powerpc tree.
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      0a0c50f7
    • M
      powerpc/kvm: Save and restore host AMR/IAMR/UAMOR · c3c7470c
      Michael Ellerman 提交于
      When the hash MMU is active the AMR, IAMR and UAMOR are used for
      pkeys. The AMR is directly writable by user space, and the UAMOR masks
      those writes, meaning both registers are effectively user register
      state. The IAMR is used to create an execute only key.
      
      Also we must maintain the value of at least the AMR when running in
      process context, so that any memory accesses done by the kernel on
      behalf of the process are correctly controlled by the AMR.
      
      Although we are correctly switching all registers when going into a
      guest, on returning to the host we just write 0 into all regs, except
      on Power9 where we restore the IAMR correctly.
      
      This could be observed by a user process if it writes the AMR, then
      runs a guest and we then return immediately to it without
      rescheduling. Because we have written 0 to the AMR that would have the
      effect of granting read/write permission to pages that the process was
      trying to protect.
      
      In addition, when using the Radix MMU, the AMR can prevent inadvertent
      kernel access to userspace data, writing 0 to the AMR disables that
      protection.
      
      So save and restore AMR, IAMR and UAMOR.
      
      Fixes: cf43d3b2 ("powerpc: Enable pkey subsystem")
      Cc: stable@vger.kernel.org # v4.16+
      Signed-off-by: NRussell Currey <ruscur@russell.cc>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      c3c7470c
    • A
      KVM: PPC: Book3S: Improve KVM reference counting · 716cb116
      Alexey Kardashevskiy 提交于
      The anon fd's ops releases the KVM reference in the release hook.
      However we reference the KVM object after we create the fd so there is
      small window when the release function can be called and
      dereferenced the KVM object which potentially may free it.
      
      It is not a problem at the moment as the file is created and KVM is
      referenced under the KVM lock and the release function obtains the same
      lock before dereferencing the KVM (although the lock is not held when
      calling kvm_put_kvm()) but it is potentially fragile against future changes.
      
      This references the KVM object before creating a file.
      Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      716cb116
    • J
      KVM: PPC: Book3S HV: Fix build failure without IOMMU support · e40542af
      Jordan Niethe 提交于
      Currently trying to build without IOMMU support will fail:
      
        (.text+0x1380): undefined reference to `kvmppc_h_get_tce'
        (.text+0x1384): undefined reference to `kvmppc_rm_h_put_tce'
        (.text+0x149c): undefined reference to `kvmppc_rm_h_stuff_tce'
        (.text+0x14a0): undefined reference to `kvmppc_rm_h_put_tce_indirect'
      
      This happens because turning off IOMMU support will prevent
      book3s_64_vio_hv.c from being built because it is only built when
      SPAPR_TCE_IOMMU is set, which depends on IOMMU support.
      
      Fix it using ifdefs for the undefined references.
      
      Fixes: 76d837a4 ("KVM: PPC: Book3S PR: Don't include SPAPR TCE code on non-pseries platforms")
      Signed-off-by: NJordan Niethe <jniethe5@gmail.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      e40542af
  3. 21 2月, 2019 35 次提交
    • P
      powerpc/64s: Better printing of machine check info for guest MCEs · c0577201
      Paul Mackerras 提交于
      This adds an "in_guest" parameter to machine_check_print_event_info()
      so that we can avoid trying to translate guest NIP values into
      symbolic form using the host kernel's symbol table.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c0577201
    • P
      KVM: PPC: Book3S HV: Simplify machine check handling · 884dfb72
      Paul Mackerras 提交于
      This makes the handling of machine check interrupts that occur inside
      a guest simpler and more robust, with less done in assembler code and
      in real mode.
      
      Now, when a machine check occurs inside a guest, we always get the
      machine check event struct and put a copy in the vcpu struct for the
      vcpu where the machine check occurred.  We no longer call
      machine_check_queue_event() from kvmppc_realmode_mc_power7(), because
      on POWER8, when a vcpu is running on an offline secondary thread and
      we call machine_check_queue_event(), that calls irq_work_queue(),
      which doesn't work because the CPU is offline, but instead triggers
      the WARN_ON(lazy_irq_pending()) in pnv_smp_cpu_kill_self() (which
      fires again and again because nothing clears the condition).
      
      All that machine_check_queue_event() actually does is to cause the
      event to be printed to the console.  For a machine check occurring in
      the guest, we now print the event in kvmppc_handle_exit_hv()
      instead.
      
      The assembly code at label machine_check_realmode now just calls C
      code and then continues exiting the guest.  We no longer either
      synthesize a machine check for the guest in assembly code or return
      to the guest without a machine check.
      
      The code in kvmppc_handle_exit_hv() is extended to handle the case
      where the guest is not FWNMI-capable.  In that case we now always
      synthesize a machine check interrupt for the guest.  Previously, if
      the host thinks it has recovered the machine check fully, it would
      return to the guest without any notification that the machine check
      had occurred.  If the machine check was caused by some action of the
      guest (such as creating duplicate SLB entries), it is much better to
      tell the guest that it has caused a problem.  Therefore we now always
      generate a machine check interrupt for guests that are not
      FWNMI-capable.
      Reviewed-by: NAravinda Prasad <aravinda@linux.vnet.ibm.com>
      Reviewed-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      884dfb72
    • M
      KVM: PPC: Book3S HV: Context switch AMR on Power9 · d976f680
      Michael Ellerman 提交于
      kvmhv_p9_guest_entry() implements a fast-path guest entry for Power9
      when guest and host are both running with the Radix MMU.
      
      Currently in that path we don't save the host AMR (Authority Mask
      Register) value, and we always restore 0 on return to the host. That
      is OK at the moment because the AMR is not used for storage keys with
      the Radix MMU.
      
      However we plan to start using the AMR on Radix to prevent the kernel
      from reading/writing to userspace outside of copy_to/from_user(). In
      order to make that work we need to save/restore the AMR value.
      
      We only restore the value if it is different from the guest value,
      which is already in the register when we exit to the host. This should
      mean we rarely need to actually restore the value when running a
      modern Linux as a guest, because it will be using the same value as
      us.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NRussell Currey <ruscur@russell.cc>
      d976f680
    • L
      Revert "KVM: Eliminate extra function calls in kvm_get_dirty_log_protect()" · a67794ca
      Lan Tianyu 提交于
      The value of "dirty_bitmap[i]" is already check before setting its value
      to mask. The following check of "mask" is redundant. The check of "mask" was
      introduced by commit 58d2930f ("KVM: Eliminate extra function calls in
      kvm_get_dirty_log_protect()"), revert it.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a67794ca
    • M
      x86: kvmguest: use TSC clocksource if invariant TSC is exposed · 7539b174
      Marcelo Tosatti 提交于
      The invariant TSC bit has the following meaning:
      
      "The time stamp counter in newer processors may support an enhancement,
      referred to as invariant TSC. Processor's support for invariant TSC
      is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run
      at a constant rate in all ACPI P-, C-. and T-states. This is the
      architectural behavior moving forward. On processors with invariant TSC
      support, the OS may use the TSC for wall clock timer services (instead
      of ACPI or HPET timers). TSC reads are much more efficient and do not
      incur the overhead associated with a ring transition or access to a
      platform resource."
      
      IOW, TSC does not change frequency. In such case, and with
      TSC scaling hardware available to handle migration, it is possible
      to use the TSC clocksource directly, whose system calls are
      faster.
      
      Reduce the rating of kvmclock clocksource to allow TSC clocksource
      to be the default if invariant TSC is exposed.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      v2: Use feature bits and tsc_unstable() check (Sean Christopherson)
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7539b174
    • N
      KVM: Never start grow vCPU halt_poll_ns from value below halt_poll_ns_grow_start · dee339b5
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behaviour in case
      (vcpu->halt_poll_ns != 0) &&
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      
      In this case, vcpu->halt_poll_ns will be multiplied by grow factor
      (halt_poll_ns_grow) which will require several grow iteration in order
      to reach a value bigger than halt_poll_ns_grow_start.
      This means that growing vcpu->halt_poll_ns from value of 0 is slower
      than growing it from a positive value less than halt_poll_ns_grow_start.
      Which is misleading and inaccurate.
      
      Fix issue by changing grow_halt_poll_ns() to set vcpu->halt_poll_ns
      to halt_poll_ns_grow_start in any case that
      (vcpu->halt_poll_ns < halt_poll_ns_grow_start).
      Regardless if vcpu->halt_poll_ns is 0.
      
      use READ_ONCE to get a consistent number for all cases.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dee339b5
    • N
      KVM: Expose the initial start value in grow_halt_poll_ns() as a module parameter · 49113d36
      Nir Weiner 提交于
      The hard-coded value 10000 in grow_halt_poll_ns() stands for the initial
      start value when raising up vcpu->halt_poll_ns.
      It actually sets the first timeout to the first polling session.
      This value has significant effect on how tolerant we are to outliers.
      On the standard case, higher value is better - we will spend more time
      in the polling busyloop, handle events/interrupts faster and result
      in better performance.
      But on outliers it puts us in a busy loop that does nothing.
      Even if the shrink factor is zero, we will still waste time on the first
      iteration.
      The optimal value changes between different workloads. It depends on
      outliers rate and polling sessions length.
      As this value has significant effect on the dynamic halt-polling
      algorithm, it should be configurable and exposed.
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49113d36
    • N
      KVM: grow_halt_poll_ns() should never shrink vCPU halt_poll_ns · 7fa08e71
      Nir Weiner 提交于
      grow_halt_poll_ns() have a strange behavior in case
      (halt_poll_ns_grow == 0) && (vcpu->halt_poll_ns != 0).
      
      In this case, vcpu->halt_pol_ns will be set to zero.
      That results in shrinking instead of growing.
      
      Fix issue by changing grow_halt_poll_ns() to not modify
      vcpu->halt_poll_ns in case halt_poll_ns_grow is zero
      Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NNir Weiner <nir.weiner@oracle.com>
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7fa08e71
    • S
      KVM: x86/mmu: Consolidate kvm_mmu_zap_all() and kvm_mmu_zap_mmio_sptes() · 8ab3c471
      Sean Christopherson 提交于
      ...via a new helper, __kvm_mmu_zap_all().  An alternative to passing a
      'bool mmio_only' would be to pass a callback function to filter the
      shadow page, i.e. to make __kvm_mmu_zap_all() generic and reusable, but
      zapping all shadow pages is a last resort, i.e. making the helper less
      extensible is a feature of sorts.  And the explicit MMIO parameter makes
      it easy to preserve the WARN_ON_ONCE() if a restart is triggered when
      zapping MMIO sptes.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8ab3c471
    • S
      KVM: x86/mmu: WARN if zapping a MMIO spte results in zapping children · 24efe61f
      Sean Christopherson 提交于
      Paolo expressed a concern that kvm_mmu_zap_mmio_sptes() could have a
      quadratic runtime[1], i.e. restarting the spte walk while zapping only
      MMIO sptes could result in re-walking large portions of the list over
      and over due to the non-MMIO sptes encountered before the restart not
      being removed.
      
      At the time, the concern was legitimate as the walk was restarted when
      any spte was zapped.  But that is no longer the case as the walk is now
      restarted iff one or more children have been zapped, which is necessary
      because zapping children makes the active_mmu_pages list unstable.
      
      Furthermore, it should be impossible for an MMIO spte to have children,
      i.e. zapping an MMIO spte should never result in zapping children.  In
      other words, kvm_mmu_zap_mmio_sptes() should never restart its walk, and
      so should always execute in linear time.  WARN if this assertion fails.
      
      Although it should never be needed, leave the restart logic in place.
      In normal operation, the cost is at worst an extra CMP+Jcc, and if for
      some reason the list does become unstable, not restarting would likely
      crash KVM, or worse, the kernel.
      
      [1] https://patchwork.kernel.org/patch/10756589/#22452085Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24efe61f
    • S
      KVM: x86/mmu: Differentiate between nr zapped and list unstable · 83cdb568
      Sean Christopherson 提交于
      The return value of kvm_mmu_prepare_zap_page() has evolved to become
      overloaded to convey two separate pieces of information.  1) was at
      least one page zapped and 2) has the list of MMU pages become unstable.
      
      In it's original incarnation (as kvm_mmu_zap_page()), there was no
      return value at all.  Commit 07385413 ("KVM: MMU: awareness of new
      kvm_mmu_zap_page behaviour") added a return value in preparation for
      commit 4731d4c7 ("KVM: MMU: out of sync shadow core").  Although
      the return value was of type 'int', it was actually used as a boolean
      to indicate whether or not active_mmu_pages may have become unstable due
      to zapping children.  Walking a list with list_for_each_entry_safe()
      only protects against deleting/moving the current entry, i.e. zapping a
      child page would break iteration due to modifying any number of entries.
      
      Later, commit 60c8aec6 ("KVM: MMU: use page array in unsync walk")
      modified mmu_zap_unsync_children() to return an approximation of the
      number of children zapped.  This was not intentional, it was simply a
      side effect of how the code was written.
      
      The unintented side affect was then morphed into an actual feature by
      commit 77662e00 ("KVM: MMU: fix kvm_mmu_zap_page() and its calling
      path"), which modified kvm_mmu_change_mmu_pages() to use the number of
      zapped pages when determining the number of MMU pages in use by the VM.
      
      Finally, commit 54a4f023 ("KVM: MMU: make kvm_mmu_zap_page() return
      the number of pages it actually freed") added the initial page to the
      return value to make its behavior more consistent with what most users
      would expect.  Incorporating the initial parent page in the return value
      of kvm_mmu_zap_page() breaks the original usage of restarting a list
      walk on a non-zero return value to handle a potentially unstable list,
      i.e. walks will unnecessarily restart when any page is zapped.
      
      Fix this by restoring the original behavior of kvm_mmu_zap_page(), i.e.
      return a boolean to indicate that the list may be unstable and move the
      number of zapped children to a dedicated parameter.  Since the majority
      of callers to kvm_mmu_prepare_zap_page() don't care about either return
      value, preserve the current definition of kvm_mmu_prepare_zap_page() by
      making it a wrapper of a new helper, __kvm_mmu_prepare_zap_page().  This
      avoids having to update every call site and also provides cleaner code
      for functions that only care about the number of pages zapped.
      
      Fixes: 54a4f023 ("KVM: MMU: make kvm_mmu_zap_page() return
                            the number of pages it actually freed")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83cdb568
    • S
      Revert "KVM: MMU: fast invalidate all pages" · ea145aac
      Sean Christopherson 提交于
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1], now that all users of the fast invalidate
      mechanism are gone.
      
      This reverts commit 5304b8d3.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea145aac
    • S
      KVM: x86/mmu: Voluntarily reschedule as needed when zapping all sptes · 5d6317ca
      Sean Christopherson 提交于
      Call cond_resched_lock() when zapping all sptes to reschedule if needed
      or to release and reacquire mmu_lock in case of contention.  There is no
      need to flush or zap when temporarily dropping mmu_lock as zapping all
      sptes is done only when the owning userspace VMM has exited or when the
      VM is being destroyed, i.e. there is no interplay with memslots or MMIO
      generations to worry about.
      
      Be paranoid and restart the walk if mmu_lock is dropped to avoid any
      potential issues with consuming a stale iterator.  The overhead in doing
      so is negligible as at worst there will be a few root shadow pages at
      the head of the list, i.e. the iterator is essentially the head of the
      list already.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5d6317ca
    • S
      KVM: x86/mmu: skip over invalid root pages when zapping all sptes · 8a674adc
      Sean Christopherson 提交于
      ...to guarantee forward progress.  When zapped, root pages are marked
      invalid and moved to the head of the active pages list until they are
      explicitly freed.  Theoretically, having unzappable root pages at the
      head of the list could prevent kvm_mmu_zap_all() from making forward
      progress were a future patch to add a loop restart after processing a
      page, e.g. to drop mmu_lock on contention.
      
      Although kvm_mmu_prepare_zap_page() can theoretically take action on
      invalid pages, e.g. to zap unsync children, functionally it's not
      necessary (root pages will be re-zapped when freed) and practically
      speaking the odds of e.g. @unsync or @unsync_children becoming %true
      while zapping all pages is basically nil.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8a674adc
    • S
      Revert "KVM: x86: use the fast way to invalidate all pages" · 7390de1e
      Sean Christopherson 提交于
      Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all().
      Flushing all shadow entries is only done during VM teardown, i.e.
      kvm_arch_flush_shadow_all() is only called when the associated MM struct
      is being released or when the VM instance is being freed.
      
      Although the performance of teardown itself isn't critical, KVM should
      still voluntarily schedule to play nice with the rest of the kernel;
      but that can be done without the fast invalidate mechanism in a future
      patch.
      
      This reverts commit 6ca18b69.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7390de1e
    • S
      Revert "KVM: MMU: show mmu_valid_gen in shadow page related tracepoints" · b59c4830
      Sean Christopherson 提交于
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 2248b023.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b59c4830
    • S
      Revert "KVM: MMU: add tracepoint for kvm_mmu_invalidate_all_pages" · 42560fb1
      Sean Christopherson 提交于
      ...as part of removing x86 KVM's fast invalidate mechanism, i.e. this
      is one part of a revert all patches from the series that introduced the
      mechanism[1].
      
      This reverts commit 35006126.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42560fb1
    • S
      Revert "KVM: MMU: zap pages in batch" · 43d2b14b
      Sean Christopherson 提交于
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit e7d11c7a.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43d2b14b
    • S
      Revert "KVM: MMU: collapse TLB flushes when zap all pages" · 210f4942
      Sean Christopherson 提交于
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit f34d251d.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      210f4942
    • S
      Revert "KVM: MMU: reclaim the zapped-obsolete page first" · 52d5dedc
      Sean Christopherson 提交于
      Unwinding optimizations related to obsolete pages is a step towards
      removing x86 KVM's fast invalidate mechanism, i.e. this is one part of
      a revert all patches from the series that introduced the mechanism[1].
      
      This reverts commit 365c8868.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      52d5dedc
    • S
      KVM: x86/mmu: Remove is_obsolete() call · 5ff05683
      Sean Christopherson 提交于
      Unwinding usage of is_obsolete() is a step towards removing x86's fast
      invalidate mechanism, i.e. this is one part of a revert all patches from
      the series that introduced the mechanism[1].
      
      This is a partial revert of commit 05988d72 ("KVM: MMU: reduce
      KVM_REQ_MMU_RELOAD when root page is zapped").
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ff05683
    • S
      KVM: x86/mmu: Voluntarily reschedule as needed when zapping MMIO sptes · 571c5af0
      Sean Christopherson 提交于
      Call cond_resched_lock() when zapping MMIO to reschedule if needed or to
      release and reacquire mmu_lock in case of contention.  There is no need
      to flush or zap when temporarily dropping mmu_lock as zapping MMIO sptes
      is done when holding the memslots lock and with the "update in-progress"
      bit set in the memslots generation, which disables MMIO spte caching.
      The walk does need to be restarted if mmu_lock is dropped as the active
      pages list may be modified.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      571c5af0
    • S
      Revert "KVM: MMU: drop kvm_mmu_zap_mmio_sptes" · 4771450c
      Sean Christopherson 提交于
      Revert back to a dedicated (and slower) mechanism for handling the
      scenario where all MMIO shadow PTEs need to be zapped due to overflowing
      the MMIO generation number.  The MMIO generation scenario is almost
      literally a one-in-a-million occurrence, i.e. is not a performance
      sensitive scenario.
      
      Restoring kvm_mmu_zap_mmio_sptes() leaves VM teardown as the only user
      of kvm_mmu_invalidate_zap_all_pages() and paves the way for removing
      the fast invalidate mechanism altogether.
      
      This reverts commit a8eca9dc.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4771450c
    • S
      Revert "KVM: MMU: document fast invalidate all pages" · a592a3b8
      Sean Christopherson 提交于
      Remove x86 KVM's fast invalidate mechanism, i.e. revert all patches
      from the original series[1].
      
      Though not explicitly stated, for all intents and purposes the fast
      invalidate mechanism was added to speed up the scenario where removing
      a memslot, e.g. as part of accessing reading PCI ROM, caused KVM to
      flush all shadow entries[1].  Now that the memslot case flushes only
      shadow entries belonging to the memslot, i.e. doesn't use the fast
      invalidate mechanism, the only remaining usage of the mechanism are
      when the VM is being destroyed and when the MMIO generation rolls
      over.
      
      When a VM is being destroyed, either there are no active vcpus, i.e.
      there's no lock contention, or the VM has ungracefully terminated, in
      which case we want to reclaim its pages as quickly as possible, i.e.
      not release the MMU lock if there are still CPUs executing in the VM.
      
      The MMIO generation scenario is almost literally a one-in-a-million
      occurrence, i.e. is not a performance sensitive scenario.
      
      Given that lock-breaking is not desirable (VM teardown) or irrelevant
      (MMIO generation overflow), remove the fast invalidate mechanism to
      simplify the code (a small amount) and to discourage future code from
      zapping all pages as using such a big hammer should be a last restort.
      
      This reverts commit f6f8adee.
      
      [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a592a3b8
    • S
      KVM: x86/mmu: Zap only the relevant pages when removing a memslot · 4e103134
      Sean Christopherson 提交于
      Modify kvm_mmu_invalidate_zap_pages_in_memslot(), a.k.a. the x86 MMU's
      handler for kvm_arch_flush_shadow_memslot(), to zap only the pages/PTEs
      that actually belong to the memslot being removed.  This improves
      performance, especially why the deleted memslot has only a few shadow
      entries, or even no entries.  E.g. a microbenchmark to access regular
      memory while concurrently reading PCI ROM to trigger memslot deletion
      showed a 5% improvement in throughput.
      
      Cc: Xiao Guangrong <guangrong.xiao@gmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4e103134
    • S
      KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap() · a2113634
      Sean Christopherson 提交于
      ...and into a separate helper, kvm_mmu_remote_flush_or_zap(), that does
      not require a vcpu so that the code can be (re)used by
      kvm_mmu_invalidate_zap_pages_in_memslot().
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a2113634
    • S
      KVM: x86/mmu: Move slot_level_*() helper functions up a few lines · 85875a13
      Sean Christopherson 提交于
      ...so that kvm_mmu_invalidate_zap_pages_in_memslot() can utilize the
      helpers in future patches.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85875a13
    • S
      KVM: Move the memslot update in-progress flag to bit 63 · 164bf7e5
      Sean Christopherson 提交于
      ...now that KVM won't explode by moving it out of bit 0.  Using bit 63
      eliminates the need to jump over bit 0, e.g. when calculating a new
      memslots generation or when propagating the memslots generation to an
      MMIO spte.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      164bf7e5
    • S
      KVM: Remove the hack to trigger memslot generation wraparound · 0e32958e
      Sean Christopherson 提交于
      x86 captures a subset of the memslot generation (19 bits) in its MMIO
      sptes so that it can expedite emulated MMIO handling by checking only
      the releveant spte, i.e. doesn't need to do a full page fault walk.
      
      Because the MMIO sptes capture only 19 bits (due to limited space in
      the sptes), there is a non-zero probability that the MMIO generation
      could wrap, e.g. after 500k memslot updates.  Since normal usage is
      extremely unlikely to result in 500k memslot updates, a hack was added
      by commit 69c9ea93 ("KVM: MMU: init kvm generation close to mmio
      wrap-around value") to offset the MMIO generation in order to trigger
      a wraparound, e.g. after 150 memslot updates.
      
      When separate memslot generation sequences were assigned to each
      address space, commit 00f034a1 ("KVM: do not bias the generation
      number in kvm_current_mmio_generation") moved the offset logic into the
      initialization of the memslot generation itself so that the per-address
      space bit(s) were not dropped/corrupted by the MMIO shenanigans.
      
      Remove the offset hack for three reasons:
      
        - While it does exercise x86's kvm_mmu_invalidate_mmio_sptes(), simply
          wrapping the generation doesn't actually test the interesting case
          of having stale MMIO sptes with the new generation number, e.g. old
          sptes with a generation number of 0.
      
        - Triggering kvm_mmu_invalidate_mmio_sptes() prematurely makes its
          performance rather important since the probability of invalidating
          MMIO sptes jumps from "effectively never" to "fairly likely".  This
          limits what can be done in future patches, e.g. to simplify the
          invalidation code, as doing so without proper caution could lead to
          a noticeable performance regression.
      
        - Forcing the memslots generation, which is a 64-bit number, to wrap
          prevents KVM from assuming the memslots generation will never wrap.
          This in turn prevents KVM from using an arbitrary bit for the
          "update in-progress" flag, e.g. using bit 63 would immediately
          collide with using a large value as the starting generation number.
          The "update in-progress" flag is effectively forced into bit 0 so
          that it's (subtly) taken into account when incrementing the
          generation.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0e32958e
    • S
      KVM: x86: Refactor the MMIO SPTE generation handling · cae7ed3c
      Sean Christopherson 提交于
      The code to propagate the memslots generation number into MMIO sptes is
      a bit convoluted.  The "what" is relatively straightfoward, e.g. the
      comment explaining which bits go where is quite readable, but the "how"
      requires a lot of staring to understand what is happening.  For example,
      'MMIO_GEN_LOW_SHIFT' is actually used to calculate the high bits of the
      spte, while 'MMIO_SPTE_GEN_LOW_SHIFT' is used to calculate the low bits.
      
      Refactor the code to:
      
        - use #defines whose values align with the bits defined in the comment
        - use consistent code for both the high and low mask
        - explicitly highlight the handling of bit 0 (update in-progress flag)
        - explicitly call out that the defines are for MMIO sptes (to avoid
          confusion with the per-vCPU MMIO cache, which uses the full memslots
          generation)
      
      In addition to making the code a little less magical, this paves the way
      for moving the update in-progress flag to bit 63 without having to
      simultaneously rewrite all of the MMIO spte code.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cae7ed3c
    • S
      KVM: x86: Use a u64 when passing the MMIO gen around · 5192f9b9
      Sean Christopherson 提交于
      KVM currently uses an 'unsigned int' for the MMIO generation number
      despite it being derived from the 64-bit memslots generation and
      being propagated to (potentially) 64-bit sptes.  There is no hidden
      agenda behind using an 'unsigned int', it's done simply because the
      MMIO generation will never set bits above bit 19.
      
      Passing a u64 will allow the "update in-progress" flag to be relocated
      from bit 0 to bit 63 and removes the need to cast the generation back
      to a u64 when propagating it to a spte.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5192f9b9
    • S
      KVM: Explicitly define the "memslot update in-progress" bit · 361209e0
      Sean Christopherson 提交于
      KVM uses bit 0 of the memslots generation as an "update in-progress"
      flag, which is used by x86 to prevent caching MMIO access while the
      memslots are changing.  Although the intended behavior is flag-like,
      e.g. MMIO sptes intentionally drop the in-progress bit so as to avoid
      caching data from in-flux memslots, the implementation oftentimes treats
      the bit as part of the generation number itself, e.g. incrementing the
      generation increments twice, once to set the flag and once to clear it.
      
      Prior to commit 4bd518f1 ("KVM: use separate generations for
      each address space"), incorporating the "update in-progress" bit into
      the generation number largely made sense, e.g. "real" generations are
      even, "bogus" generations are odd, most code doesn't need to be aware of
      the bit, etc...
      
      Now that unique memslots generation numbers are assigned to each address
      space, stealthing the in-progress status into the generation number
      results in a wide variety of subtle code, e.g. kvm_create_vm() jumps
      over bit 0 when initializing the memslots generation without any hint as
      to why.
      
      Explicitly define the flag and convert as much code as possible (which
      isn't much) to actually treat it like a flag.  This paves the way for
      eventually using a different bit for "update in-progress" so that it can
      be a flag in truth instead of a awkward extension to the generation
      number.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      361209e0
    • S
      KVM: x86/mmu: Do not cache MMIO accesses while memslots are in flux · ddfd1730
      Sean Christopherson 提交于
      When installing new memslots, KVM sets bit 0 of the generation number to
      indicate that an update is in-progress.  Until the update is complete,
      there are no guarantees as to whether a vCPU will see the old or the new
      memslots.  Explicity prevent caching MMIO accesses so as to avoid using
      an access cached from the old memslots after the new memslots have been
      installed.
      
      Note that it is unclear whether or not disabling caching during the
      update window is strictly necessary as there is no definitive
      documentation as to what ordering guarantees KVM provides with respect
      to updating memslots.  That being said, the MMIO spte code does not
      allow reusing sptes created while an update is in-progress, and the
      associated documentation explicitly states:
      
          We do not want to use an MMIO sptes created with an odd generation
          number, ...  If KVM is unlucky and creates an MMIO spte while the
          low bit is 1, the next access to the spte will always be a cache miss.
      
      At the very least, disabling the per-vCPU MMIO cache during updates will
      make its behavior consistent with the MMIO spte behavior and
      documentation.
      
      Fixes: 56f17dd3 ("kvm: x86: fix stale mmio cache bug")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ddfd1730
    • S
      KVM: x86/mmu: Detect MMIO generation wrap in any address space · e1359e2b
      Sean Christopherson 提交于
      The check to detect a wrap of the MMIO generation explicitly looks for a
      generation number of zero.  Now that unique memslots generation numbers
      are assigned to each address space, only address space 0 will get a
      generation number of exactly zero when wrapping.  E.g. when address
      space 1 goes from 0x7fffe to 0x80002, the MMIO generation number will
      wrap to 0x2.  Adjust the MMIO generation to strip the address space
      modifier prior to checking for a wrap.
      
      Fixes: 4bd518f1 ("KVM: use separate generations for each address space")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e1359e2b
    • S
      KVM: Call kvm_arch_memslots_updated() before updating memslots · 15248258
      Sean Christopherson 提交于
      kvm_arch_memslots_updated() is at this point in time an x86-specific
      hook for handling MMIO generation wraparound.  x86 stashes 19 bits of
      the memslots generation number in its MMIO sptes in order to avoid
      full page fault walks for repeat faults on emulated MMIO addresses.
      Because only 19 bits are used, wrapping the MMIO generation number is
      possible, if unlikely.  kvm_arch_memslots_updated() alerts x86 that
      the generation has changed so that it can invalidate all MMIO sptes in
      case the effective MMIO generation has wrapped so as to avoid using a
      stale spte, e.g. a (very) old spte that was created with generation==0.
      
      Given that the purpose of kvm_arch_memslots_updated() is to prevent
      consuming stale entries, it needs to be called before the new generation
      is propagated to memslots.  Invalidating the MMIO sptes after updating
      memslots means that there is a window where a vCPU could dereference
      the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO
      spte that was created with (pre-wrap) generation==0.
      
      Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      15248258