1. 08 9月, 2018 1 次提交
  2. 07 9月, 2018 1 次提交
  3. 30 8月, 2018 10 次提交
    • S
      KVM: x86: Unexport x86_emulate_instruction() · c60658d1
      Sean Christopherson 提交于
      Allowing x86_emulate_instruction() to be called directly has led to
      subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
      in the emulation type.  While most of the blame lies on re-execute
      being opt-out, exporting x86_emulate_instruction() also exposes its
      cr2 parameter, which may have contributed to commit d391f120
      ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
      when running nested") using x86_emulate_instruction() instead of
      emulate_instruction() because "hey, I have a cr2!", which in turn
      introduced its EMULTYPE_NO_REEXECUTE bug.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c60658d1
    • S
      KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() · 0ce97a2b
      Sean Christopherson 提交于
      Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
      specific function, and there's no conflict that prevents adding the
      kvm_ prefix.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0ce97a2b
    • S
      KVM: x86: Do not re-{try,execute} after failed emulation in L2 · 6c3dfeb6
      Sean Christopherson 提交于
      Commit a6f177ef ("KVM: Reenter guest after emulation failure if
      due to access to non-mmio address") added reexecute_instruction() to
      handle the scenario where two (or more) vCPUS race to write a shadowed
      page, i.e. reexecute_instruction() is intended to return true if and
      only if the instruction being emulated was accessing a shadowed page.
      As L0 is only explicitly shadowing L1 tables, an emulation failure of
      a nested VM instruction cannot be due to a race to write a shadowed
      page and so should never be re-executed.
      
      This fixes an issue where an "MMIO" emulation failure[1] in L2 is all
      but guaranteed to result in an infinite loop when TDP is enabled.
      Because "cr2" is actually an L2 GPA when TDP is enabled, calling
      kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped
      case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA
      and cause reexecute_instruction() to immediately return true.  The
      !mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this
      case because mmio_info_in_cache() returns false for a nested MMU (the
      MMIO caching currently handles L1 only, e.g. to cache nested guests'
      GPAs we'd have to manually flush the cache when switching between
      VMs and when L1 updated its page tables controlling the nested guest).
      
      Way back when, commit 68be0803 ("KVM: x86: never re-execute
      instruction with enabled tdp") changed reexecute_instruction() to
      always return false when using TDP under the assumption that KVM would
      only get into the emulator for MMIO.  Commit 95b3cf69 ("KVM: x86:
      let reexecute_instruction work for tdp") effectively reverted that
      behavior in order to handle the scenario where emulation failed due to
      an access from L1 to the shadow page tables for L2, but it didn't
      account for the case where emulation failed in L2 with TDP enabled.
      
      All of the above logic also applies to retry_instruction(), added by
      commit 1cb3f3ae ("KVM: x86: retry non-page-table writing
      instructions").  An indefinite loop in retry_instruction() should be
      impossible as it protects against retrying the same instruction over
      and over, but it's still correct to not retry an L2 instruction in
      the first place.
      
      Fix the immediate issue by adding a check for a nested guest when
      determining whether or not to allow retry in kvm_mmu_page_fault().
      In addition to fixing the immediate bug, add WARN_ON_ONCE in the
      retry functions since they are not designed to handle nested cases,
      i.e. they need to be modified even if there is some scenario in the
      future where we want to allow retrying a nested guest.
      
      [1] This issue was encountered after commit 3a2936de ("kvm: mmu:
          Don't expose private memslots to L2") changed the page fault path
          to return KVM_PFN_NOSLOT when translating an L2 access to a
          prive memslot.  Returning KVM_PFN_NOSLOT is semantically correct
          when we want to hide a memslot from L2, i.e. there effectively is
          no defined memory region for L2, but it has the unfortunate side
          effect of making KVM think the GFN is a MMIO page, thus triggering
          emulation.  The failure occurred with in-development code that
          deliberately exposed a private memslot to L2, which L2 accessed
          with an instruction that is not emulated by KVM.
      
      Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp")
      Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Xiao Guangrong <xiaoguangrong@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6c3dfeb6
    • S
      KVM: x86: Default to not allowing emulation retry in kvm_mmu_page_fault · 472faffa
      Sean Christopherson 提交于
      Effectively force kvm_mmu_page_fault() to opt-in to allowing retry to
      make it more obvious when and why it allows emulation to be retried.
      Previously this approach was less convenient due to retry and
      re-execute behavior being controlled by separate flags that were also
      inverted in their implementations (opt-in versus opt-out).
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      472faffa
    • S
      KVM: x86: Merge EMULTYPE_RETRY and EMULTYPE_ALLOW_REEXECUTE · 384bf221
      Sean Christopherson 提交于
      retry_instruction() and reexecute_instruction() are a package deal,
      i.e. there is no scenario where one is allowed and the other is not.
      Merge their controlling emulation type flags to enforce this in code.
      Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
      clear that we are allowing re{try,execute} to occur, as opposed to
      explicitly requesting retry of a previously failed instruction.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      384bf221
    • S
      KVM: x86: Invert emulation re-execute behavior to make it opt-in · 8065dbd1
      Sean Christopherson 提交于
      Re-execution of an instruction after emulation decode failure is
      intended to be used only when emulating shadow page accesses.  Invert
      the flag to make allowing re-execution opt-in since that behavior is
      by far in the minority.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8065dbd1
    • S
      KVM: x86: SVM: Set EMULTYPE_NO_REEXECUTE for RSM emulation · 35be0ade
      Sean Christopherson 提交于
      Re-execution after an emulation decode failure is only intended to
      handle a case where two or vCPUs race to write a shadowed page, i.e.
      we should never re-execute an instruction as part of RSM emulation.
      
      Add a new helper, kvm_emulate_instruction_from_buffer(), to support
      emulating from a pre-defined buffer.  This eliminates the last direct
      call to x86_emulate_instruction() outside of kvm_mmu_page_fault(),
      which means x86_emulate_instruction() can be unexported in a future
      patch.
      
      Fixes: 7607b717 ("KVM: SVM: install RSM intercept")
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      35be0ade
    • S
      KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr · c4409905
      Sean Christopherson 提交于
      Re-execution after an emulation decode failure is only intended to
      handle a case where two or vCPUs race to write a shadowed page, i.e.
      we should never re-execute an instruction as part of MMIO emulation.
      As handle_ept_misconfig() is only used for MMIO emulation, it should
      pass EMULTYPE_NO_REEXECUTE when using the emulator to skip an instr
      in the fast-MMIO case where VM_EXIT_INSTRUCTION_LEN is invalid.
      
      And because the cr2 value passed to x86_emulate_instruction() is only
      destined for use when retrying or reexecuting, we can simply call
      emulate_instruction().
      
      Fixes: d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length
                            for fast MMIO when running nested")
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c4409905
    • C
      KVM: SVM: remove unused variable dst_vaddr_end · 0186ec82
      Colin Ian King 提交于
      Variable dst_vaddr_end is being assigned but is never used hence it is
      redundant and can be removed.
      
      Cleans up clang warning:
      variable 'dst_vaddr_end' set but not used [-Wunused-but-set-variable]
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0186ec82
    • V
      KVM: nVMX: avoid redundant double assignment of nested_run_pending · b871da4a
      Vitaly Kuznetsov 提交于
      nested_run_pending is set 20 lines above and check_vmentry_prereqs()/
      check_vmentry_postreqs() don't seem to be resetting it (the later, however,
      checks it).
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NEduardo Valentin <eduval@amazon.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b871da4a
  4. 23 8月, 2018 1 次提交
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  5. 22 8月, 2018 5 次提交
    • P
      KVM: VMX: fixes for vmentry_l1d_flush module parameter · 0027ff2a
      Paolo Bonzini 提交于
      Two bug fixes:
      
      1) missing entries in the l1d_param array; this can cause a host crash
      if an access attempts to reach the missing entry. Future-proof the get
      function against any overflows as well.  However, the two entries
      VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must
      not be accepted by the parse function, so disable them there.
      
      2) invalid values must be rejected even if the CPU does not have the
      bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF)
      
      ... and a small refactoring, since the .cmd field is redundant with
      the index in the array.
      Reported-by: NBandan Das <bsd@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: a7b9020bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0027ff2a
    • T
      KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled · 024d83ca
      Thomas Gleixner 提交于
      Mikhail reported the following lockdep splat:
      
      WARNING: possible irq lock inversion dependency detected
      CPU 0/KVM/10284 just changed the state of lock:
        000000000d538a88 (&st->lock){+...}, at:
        speculative_store_bypass_update+0x10b/0x170
      
      but this lock was taken by another, HARDIRQ-safe lock
      in the past:
      
      (&(&sighand->siglock)->rlock){-.-.}
      
         and interrupts could create inverse lock ordering between them.
      
      Possible interrupt unsafe locking scenario:
      
          CPU0                    CPU1
          ----                    ----
         lock(&st->lock);
                                 local_irq_disable();
                                 lock(&(&sighand->siglock)->rlock);
                                 lock(&st->lock);
          <Interrupt>
           lock(&(&sighand->siglock)->rlock);
           *** DEADLOCK ***
      
      The code path which connects those locks is:
      
         speculative_store_bypass_update()
         ssb_prctl_set()
         do_seccomp()
         do_syscall_64()
      
      In svm_vcpu_run() speculative_store_bypass_update() is called with
      interupts enabled via x86_virt_spec_ctrl_set_guest/host().
      
      This is actually a false positive, because GIF=0 so interrupts are
      disabled even if IF=1; however, we can easily move the invocations of
      x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to
      cure it, and it's a good idea to keep the GIF=0/IF=1 area as small
      and self-contained as possible.
      
      Fixes: 1f50ddb4 ("x86/speculation: Handle HT correctly on AMD")
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: kvm@vger.kernel.org
      Cc: x86@kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      024d83ca
    • S
      KVM: vmx: Inject #UD for SGX ENCLS instruction in guest · 0b665d30
      Sean Christopherson 提交于
      Virtualization of Intel SGX depends on Enclave Page Cache (EPC)
      management that is not yet available in the kernel, i.e. KVM support
      for exposing SGX to a guest cannot be added until basic support
      for SGX is upstreamed, which is a WIP[1].
      
      Until SGX is properly supported in KVM, ensure a guest sees expected
      behavior for ENCLS, i.e. all ENCLS #UD.  Because SGX does not have a
      true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS
      instruction can be executed[1] by the guest if SGX is supported by the
      system.  Intercept all ENCLS leafs (via the ENCLS- exiting control and
      field) and unconditionally inject #UD.
      
      [1] https://www.spinics.net/lists/kvm/msg171333.html or
          https://lkml.org/lkml/2018/7/3/879
      
      [2] A guest can execute ENCLS in the sense that ENCLS will not take
          an immediate #UD, but no ENCLS will ever succeed in a guest
          without explicit support from KVM (map EPC memory into the guest),
          unless KVM has a *very* egregious bug, e.g. accidentally mapped
          EPC memory into the guest SPTEs.  In other words this patch is
          needed only to prevent the guest from seeing inconsistent behavior,
          e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf
          operand(s) does not point at EPC memory) instead of #UD on ENCLS.
          Intercepting ENCLS is not required to prevent the guest from truly
          utilizing SGX.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20180814163334.25724-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0b665d30
    • Y
      x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush() · d806afa4
      Yi Wang 提交于
      Substitute spaces with tab. No functional changes.
      Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Message-Id: <1534398159-48509-1-git-send-email-wang.yi59@zte.com.cn>
      Cc: stable@vger.kernel.org # L1TF
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d806afa4
    • A
      x86: kvm: avoid unused variable warning · 7288bde1
      Arnd Bergmann 提交于
      Removing one of the two accesses of the maxphyaddr variable led to
      a harmless warning:
      
      arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
      arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
      
      Removing the #ifdef seems to be the nicest workaround, as it
      makes the code look cleaner than adding another #ifdef.
      
      Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org # L1TF
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7288bde1
  6. 21 8月, 2018 1 次提交
  7. 15 8月, 2018 1 次提交
  8. 07 8月, 2018 1 次提交
  9. 06 8月, 2018 19 次提交