1. 21 9月, 2018 1 次提交
  2. 20 9月, 2018 6 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • D
      KVM: x86: Turbo bits in MSR_PLATFORM_INFO · d84f1cff
      Drew Schmitt 提交于
      Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
      the CPUID faulting bit was settable. But now any bit in
      MSR_PLATFORM_INFO would be settable. This can be used, for example, to
      convey frequency information about the platform on which the guest is
      running.
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d84f1cff
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • S
      kvm: x86: make kvm_{load|put}_guest_fpu() static · 822f312d
      Sebastian Andrzej Siewior 提交于
      The functions
      	kvm_load_guest_fpu()
      	kvm_put_guest_fpu()
      
      are only used locally, make them static. This requires also that both
      functions are moved because they are used before their implementation.
      Those functions were exported (via EXPORT_SYMBOL) before commit
      e5bb4025 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      822f312d
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • J
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid 提交于
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
  3. 30 8月, 2018 5 次提交
    • S
      KVM: x86: Unexport x86_emulate_instruction() · c60658d1
      Sean Christopherson 提交于
      Allowing x86_emulate_instruction() to be called directly has led to
      subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
      in the emulation type.  While most of the blame lies on re-execute
      being opt-out, exporting x86_emulate_instruction() also exposes its
      cr2 parameter, which may have contributed to commit d391f120
      ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
      when running nested") using x86_emulate_instruction() instead of
      emulate_instruction() because "hey, I have a cr2!", which in turn
      introduced its EMULTYPE_NO_REEXECUTE bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c60658d1
    • S
      KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() · 0ce97a2b
      Sean Christopherson 提交于
      Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
      specific function, and there's no conflict that prevents adding the
      kvm_ prefix.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0ce97a2b
    • S
      KVM: x86: Do not re-{try,execute} after failed emulation in L2 · 6c3dfeb6
      Sean Christopherson 提交于
      Commit a6f177ef ("KVM: Reenter guest after emulation failure if
      due to access to non-mmio address") added reexecute_instruction() to
      handle the scenario where two (or more) vCPUS race to write a shadowed
      page, i.e. reexecute_instruction() is intended to return true if and
      only if the instruction being emulated was accessing a shadowed page.
      As L0 is only explicitly shadowing L1 tables, an emulation failure of
      a nested VM instruction cannot be due to a race to write a shadowed
      page and so should never be re-executed.
      
      This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
      but guaranteed to result in an infinite loop when TDP is enabled.
      Because "cr2" is actually an L2 GPA when TDP is enabled, calling
      kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
      case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
      and cause reexecute_instruction() to immediately return true.  The
      !mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
      case because mmio_info_in_cache() returns false for a nested MMU (the
      MMIO caching currently handles L1 only, e.g. to cache nested guests'
      GPAs we'd have to manually flush the cache when switching between
      VMs and when L1 updated its page tables controlling the nested guest).
      
      Way back when, commit 68be0803 ("KVM: x86: never re-execute
      instruction with enabled tdp") changed reexecute_instruction() to
      always return false when using TDP under the assumption that KVM would
      only get into the emulator for MMIO.  Commit 95b3cf69 ("KVM: x86:
      let reexecute_instruction work for tdp") effectively reverted that
      behavior in order to handle the scenario where emulation failed due to
      an access from L1 to the shadow page tables for L2, but it didn't
      account for the case where emulation failed in L2 with TDP enabled.
      
      All of the above logic also applies to retry_instruction(), added by
      commit 1cb3f3ae ("KVM: x86: retry non-page-table writing
      instructions").  An indefinite loop in retry_instruction() should be
      impossible as it protects against retrying the same instruction over
      and over, but it's still correct to not retry an L2 instruction in
      the first place.
      
      Fix the immediate issue by adding a check for a nested guest when
      determining whether or not to allow retry in kvm_mmu_page_fault().
      In addition to fixing the immediate bug, add WARN_ON_ONCE in the
      retry functions since they are not designed to handle nested cases,
      i.e. they need to be modified even if there is some scenario in the
      future where we want to allow retrying a nested guest.
      
      [1] This issue was encountered after commit 3a2936de ("kvm: mmu:
          Don't expose private memslots to L2") changed the page fault path
          to return KVM_PFN_NOSLOT when translating an L2 access to a
          prive memslot.  Returning KVM_PFN_NOSLOT is semantically correct
          when we want to hide a memslot from L2, i.e. there effectively is
          no defined memory region for L2, but it has the unfortunate side
          effect of making KVM think the GFN is a MMIO page, thus triggering
          emulation.  The failure occurred with in-development code that
          deliberately exposed a private memslot to L2, which L2 accessed
          with an instruction that is not emulated by KVM.
      
      Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp")
      Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6c3dfeb6
    • S
      KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE · 384bf221
      Sean Christopherson 提交于
      retry_instruction() and reexecute_instruction() are a package deal,
      i.e. there is no scenario where one is allowed and the other is not.
      Merge their controlling emulation type flags to enforce this in code.
      Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
      clear that we are allowing re{try,execute} to occur, as opposed to
      explicitly requesting retry of a previously failed instruction.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      384bf221
    • S
      KVM: x86: Invert emulation re-execute behavior to make it opt-in · 8065dbd1
      Sean Christopherson 提交于
      Re-execution of an instruction after emulation decode failure is
      intended to be used only when emulating shadow page accesses.  Invert
      the flag to make allowing re-execution opt-in since that behavior is
      by far in the minority.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8065dbd1
  4. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  5. 22 8月, 2018 1 次提交
    • A
      x86: kvm: avoid unused variable warning · 7288bde1
      Arnd Bergmann 提交于
      Removing one of the two accesses of the maxphyaddr variable led to
      a harmless warning:
      
      arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
      arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
      
      Removing the #ifdef seems to be the nicest workaround, as it
      makes the code look cleaner than adding another #ifdef.
      
      Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org # L1TF
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7288bde1
  6. 15 8月, 2018 1 次提交
  7. 06 8月, 2018 10 次提交
    • W
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li 提交于
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
    • T
      KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() · 74fec5b9
      Tianyu Lan 提交于
      X86_CR4_OSXSAVE check belongs to sregs check and so move into
      kvm_valid_sregs().
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74fec5b9
    • J
      kvm: x86: Remove CR3_PCID_INVD flag · 208320ba
      Junaid Shahid 提交于
      It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      208320ba
    • J
      kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest · 956bf353
      Junaid Shahid 提交于
      When the guest indicates that the TLB doesn't need to be flushed in a
      CR3 switch, we can also skip resyncing the shadow page tables since an
      out-of-sync shadow page table is equivalent to an out-of-sync TLB.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      956bf353
    • J
      kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest · ade61e28
      Junaid Shahid 提交于
      When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
      instruction indicates that the TLB doesn't need to be flushed.
      
      This change enables this optimization for MOV-to-CR3s in the guest
      that have been intercepted by KVM for shadow paging and are handled
      within the fast CR3 switch path.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ade61e28
    • J
      kvm: x86: Introduce KVM_REQ_LOAD_CR3 · 6e42782f
      Junaid Shahid 提交于
      The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
      current root_hpa.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6e42782f
    • J
      kvm: x86: Add fast CR3 switch code path · 7c390d35
      Junaid Shahid 提交于
      When using shadow paging, a CR3 switch in the guest results in a VM Exit.
      In the common case, that VM exit doesn't require much processing by KVM.
      However, it does acquire the MMU lock, which can start showing signs of
      contention under some workloads even on a 2 VCPU VM when the guest is
      using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
      lock in the most common cases e.g. when switching back and forth between
      the kernel and user mode CR3s used by KPTI with no guest page table
      changes in between.
      
      For now, this fast path is implemented only for 64-bit guests and hosts
      to avoid the handling of PDPTEs, but it can be extended later to 32-bit
      guests and/or hosts as well.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c390d35
    • J
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson 提交于
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
    • P
      KVM: x86: do not load vmcs12 pages while still in SMM · 7f7f1ba3
      Paolo Bonzini 提交于
      If the vCPU enters system management mode while running a nested guest,
      RSM starts processing the vmentry while still in SMM.  In that case,
      however, the pages pointed to by the vmcs12 might be incorrectly
      loaded from SMRAM.  To avoid this, delay the handling of the pages
      until just before the next vmentry.  This is done with a new request
      and a new entry in kvm_x86_ops, which we will be able to reuse for
      nested VMX state migration.
      
      Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7f7f1ba3
    • P
      KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'd · 44883f01
      Paolo Bonzini 提交于
      Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back
      to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you
      they are only accepted under special conditions.  This makes the API a pain to
      use.
      
      To avoid this pain, this patch makes it so that the result of the get-list
      ioctl can always be used for host-initiated get and set.  Since we don't have
      a separate way to check for read-only MSRs, this means some Hyper-V MSRs are
      ignored when written.  Arguably they should not even be in the result of
      GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the
      outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding
      Hyper-V feature.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44883f01
  8. 05 8月, 2018 1 次提交
  9. 15 7月, 2018 1 次提交
  10. 05 7月, 2018 1 次提交
    • P
      x86/KVM/VMX: Add L1D flush logic · c595ceee
      Paolo Bonzini 提交于
      Add the logic for flushing L1D on VMENTER. The flush depends on the static
      key being enabled and the new l1tf_flush_l1d flag being set.
      
      The flags is set:
       - Always, if the flush module parameter is 'always'
      
       - Conditionally at:
         - Entry to vcpu_run(), i.e. after executing user space
      
         - From the sched_in notifier, i.e. when switching to a vCPU thread.
      
         - From vmexit handlers which are considered unsafe, i.e. where
           sensitive data can be brought into L1D:
      
           - The emulator, which could be a good target for other speculative
             execution-based threats,
      
           - The MMU, which can bring host page tables in the L1 cache.
           
           - External interrupts
      
           - Nested operations that require the MMU (see above). That is
             vmptrld, vmptrst, vmclear,vmwrite,vmread.
      
           - When handling invept,invvpid
      
      [ tglx: Split out from combo patch and reduced to a single flag ]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c595ceee
  11. 14 6月, 2018 1 次提交
  12. 13 6月, 2018 1 次提交
    • K
      treewide: kvzalloc() -> kvcalloc() · 778e1cdd
      Kees Cook 提交于
      The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
      patch replaces cases of:
      
              kvzalloc(a * b, gfp)
      
      with:
              kvcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kvzalloc(a * b * c, gfp)
      
      with:
      
              kvzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvcalloc(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvzalloc
      + kvcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(sizeof(THING) * C2, ...)
      |
        kvzalloc(sizeof(TYPE) * C2, ...)
      |
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(C1 * C2, ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      778e1cdd
  13. 12 6月, 2018 3 次提交
  14. 04 6月, 2018 1 次提交
    • W
      KVM: VMX: Optimize tscdeadline timer latency · c5ce8235
      Wanpeng Li 提交于
      'Commit d0659d94 ("KVM: x86: add option to advance tscdeadline
      hrtimer expiration")' advances the tscdeadline (the timer is emulated
      by hrtimer) expiration in order that the latency which is incurred
      by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
      adds the advance tscdeadline expiration support to which the tscdeadline
      timer is emulated by VMX preemption timer to reduce the hypervisor
      lantency (handle_preemption_timer -> vmentry). The guest can also
      set an expiration that is very small (for example in Linux if an
      hrtimer feeds a expiration in the past); in that case we set delta_tsc
      to 0, leading to an immediately vmexit when delta_tsc is not bigger than
      advance ns.
      
      This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
      a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5ce8235
  15. 02 6月, 2018 1 次提交
  16. 26 5月, 2018 2 次提交
  17. 24 5月, 2018 1 次提交
    • W
      KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed · c4d21882
      Wei Huang 提交于
      The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0)
      allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is
      supposed to update these CPUID bits when CR4 is updated. Current KVM
      code doesn't handle some special cases when updates come from emulator.
      Here is one example:
      
        Step 1: guest boots
        Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1
        Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1
        Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv
      
      Step 4 above will cause an #UD and guest crash because guest OS hasn't
      turned on OSXAVE yet. This patch solves the problem by comparing the the
      old_cr4 with cr4. If the related bits have been changed,
      kvm_update_cpuid() needs to be called.
      Signed-off-by: NWei Huang <wei@redhat.com>
      Reviewed-by: NBandan Das <bsd@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c4d21882
  18. 23 5月, 2018 1 次提交
  19. 17 5月, 2018 1 次提交