1. 17 4月, 2021 30 次提交
  2. 06 4月, 2021 1 次提交
    • S
      x86/sgx: Introduce virtual EPC for use by KVM guests · 540745dd
      Sean Christopherson 提交于
      Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw"
      Enclave Page Cache (EPC) without an associated enclave. The intended
      and only known use case for raw EPC allocation is to expose EPC to a
      KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM
      Kconfig.
      
      The SGX driver uses the misc device /dev/sgx_enclave to support
      userspace in creating an enclave. Each file descriptor returned from
      opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver,
      KVM doesn't control how the guest uses the EPC, therefore EPC allocated
      to a KVM guest is not associated with an enclave, and /dev/sgx_enclave
      is not suitable for allocating EPC for a KVM guest.
      
      Having separate device nodes for the SGX driver and KVM virtual EPC also
      allows separate permission control for running host SGX enclaves and KVM
      SGX guests.
      
      To use /dev/sgx_vepc to allocate a virtual EPC instance with particular
      size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the
      intended size to get an address range of virtual EPC. Then it may use
      the address range to create one KVM memory slot as virtual EPC for
      a guest.
      
      Implement the "raw" EPC allocation in the x86 core-SGX subsystem via
      /dev/sgx_vepc rather than in KVM. Doing so has two major advantages:
      
        - Does not require changes to KVM's uAPI, e.g. EPC gets handled as
          just another memory backend for guests.
      
        - EPC management is wholly contained in the SGX subsystem, e.g. SGX
          does not have to export any symbols, changes to reclaim flows don't
          need to be routed through KVM, SGX's dirty laundry doesn't have to
          get aired out for the world to see, and so on and so forth.
      
      The virtual EPC pages allocated to guests are currently not reclaimable.
      Reclaiming an EPC page used by enclave requires a special reclaim
      mechanism separate from normal page reclaim, and that mechanism is not
      supported for virutal EPC pages. Due to the complications of handling
      reclaim conflicts between guest and host, reclaiming virtual EPC pages
      is significantly more complex than basic support for SGX virtualization.
      
       [ bp:
         - Massage commit message and comments
         - use cpu_feature_enabled()
         - vertically align struct members init
         - massage Virtual EPC clarification text
         - move Kconfig prompt to Virtualization ]
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: NKai Huang <kai.huang@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NJarkko Sakkinen <jarkko@kernel.org>
      Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com
      540745dd
  3. 01 4月, 2021 2 次提交
    • P
      KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit · 3c346c0c
      Paolo Bonzini 提交于
      Fixing nested_vmcb_check_save to avoid all TOC/TOU races
      is a bit harder in released kernels, so do the bare minimum
      by avoiding that EFER.SVME is cleared.  This is problematic
      because svm_set_efer frees the data structures for nested
      virtualization if EFER.SVME is cleared.
      
      Also check that EFER.SVME remains set after a nested vmexit;
      clearing it could happen if the bit is zero in the save area
      that is passed to KVM_SET_NESTED_STATE (the save area of the
      nested state corresponds to the nested hypervisor's state
      and is restored on the next nested vmexit).
      
      Cc: stable@vger.kernel.org
      Fixes: 2fcf4876 ("KVM: nSVM: implement on demand allocation of the nested state")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c346c0c
    • P
      KVM: SVM: load control fields from VMCB12 before checking them · a58d9166
      Paolo Bonzini 提交于
      Avoid races between check and use of the nested VMCB controls.  This
      for example ensures that the VMRUN intercept is always reflected to the
      nested hypervisor, instead of being processed by the host.  Without this
      patch, it is possible to end up with svm->nested.hsave pointing to
      the MSR permission bitmap for nested guests.
      
      This bug is CVE-2021-29657.
      Reported-by: NFelix Wilhelm <fwilhelm@google.com>
      Cc: stable@vger.kernel.org
      Fixes: 2fcf4876 ("KVM: nSVM: implement on demand allocation of the nested state")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a58d9166
  4. 31 3月, 2021 3 次提交
    • S
      KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages · 33a31641
      Sean Christopherson 提交于
      Prevent the TDP MMU from yielding when zapping a gfn range during NX
      page recovery.  If a flush is pending from a previous invocation of the
      zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU
      has not accumulated a flush for the current invocation, then yielding
      will release mmu_lock with stale TLB entries.
      
      That being said, this isn't technically a bug fix in the current code, as
      the TDP MMU will never yield in this case.  tdp_mmu_iter_cond_resched()
      will yield if and only if it has made forward progress, as defined by the
      current gfn vs. the last yielded (or starting) gfn.  Because zapping a
      single shadow page is guaranteed to (a) find that page and (b) step
      sideways at the level of the shadow page, the TDP iter will break its loop
      before getting a chance to yield.
      
      But that is all very, very subtle, and will break at the slightest sneeze,
      e.g. zapping while holding mmu_lock for read would break as the TDP MMU
      wouldn't be guaranteed to see the present shadow page, and thus could step
      sideways at a lower level.
      
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210325200119.1359384-4-seanjc@google.com>
      [Add lockdep assertion. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      33a31641
    • S
      KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping · 048f4980
      Sean Christopherson 提交于
      Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which
      does the flush itself if and only if it yields (which it will never do in
      this particular scenario), and otherwise expects the caller to do the
      flush.  If pages are zapped from the TDP MMU but not the legacy MMU, then
      no flush will occur.
      
      Fixes: 29cf0f50 ("kvm: x86/mmu: NX largepage recovery for TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210325200119.1359384-3-seanjc@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      048f4980
    • S
      KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap · a835429c
      Sean Christopherson 提交于
      When flushing a range of GFNs across multiple roots, ensure any pending
      flush from a previous root is honored before yielding while walking the
      tables of the current root.
      
      Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites its local
      "flush" with the result to avoid redundant flushes.  zap_gfn_range()
      preserves and return the incoming "flush", unless of course the flush was
      performed prior to yielding and no new flush was triggered.
      
      Fixes: 1af4a960 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
      Cc: stable@vger.kernel.org
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210325200119.1359384-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a835429c
  5. 17 3月, 2021 4 次提交