1. 22 6月, 2021 1 次提交
    • T
      x86/fpu: Make init_fpstate correct with optimized XSAVE · f9dfb5e3
      Thomas Gleixner 提交于
      The XSAVE init code initializes all enabled and supported components with
      XRSTOR(S) to init state. Then it XSAVEs the state of the components back
      into init_fpstate which is used in several places to fill in the init state
      of components.
      
      This works correctly with XSAVE, but not with XSAVEOPT and XSAVES because
      those use the init optimization and skip writing state of components which
      are in init state. So init_fpstate.xsave still contains all zeroes after
      this operation.
      
      There are two ways to solve that:
      
         1) Use XSAVE unconditionally, but that requires to reshuffle the buffer when
            XSAVES is enabled because XSAVES uses compacted format.
      
         2) Save the components which are known to have a non-zero init state by other
            means.
      
      Looking deeper, #2 is the right thing to do because all components the
      kernel supports have all-zeroes init state except the legacy features (FP,
      SSE). Those cannot be hard coded because the states are not identical on all
      CPUs, but they can be saved with FXSAVE which avoids all conditionals.
      
      Use FXSAVE to save the legacy FP/SSE components in init_fpstate along with
      a BUILD_BUG_ON() which reminds developers to validate that a newly added
      component has all zeroes init state. As a bonus remove the now unused
      copy_xregs_to_kernel_booting() crutch.
      
      The XSAVE and reshuffle method can still be implemented in the unlikely
      case that components are added which have a non-zero init state and no
      other means to save them. For now, FXSAVE is just simple and good enough.
      
        [ bp: Fix a typo or two in the text. ]
      
      Fixes: 6bad06b7 ("x86, xsave: Use xsaveopt in context-switch path when supported")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210618143444.587311343@linutronix.de
      f9dfb5e3
  2. 09 6月, 2021 2 次提交
  3. 03 6月, 2021 1 次提交
    • T
      x86/cpufeatures: Force disable X86_FEATURE_ENQCMD and remove update_pasid() · 9bfecd05
      Thomas Gleixner 提交于
      While digesting the XSAVE-related horrors which got introduced with
      the supervisor/user split, the recent addition of ENQCMD-related
      functionality got on the radar and turned out to be similarly broken.
      
      update_pasid(), which is only required when X86_FEATURE_ENQCMD is
      available, is invoked from two places:
      
       1) From switch_to() for the incoming task
      
       2) Via a SMP function call from the IOMMU/SMV code
      
      #1 is half-ways correct as it hacks around the brokenness of get_xsave_addr()
         by enforcing the state to be 'present', but all the conditionals in that
         code are completely pointless for that.
      
         Also the invocation is just useless overhead because at that point
         it's guaranteed that TIF_NEED_FPU_LOAD is set on the incoming task
         and all of this can be handled at return to user space.
      
      #2 is broken beyond repair. The comment in the code claims that it is safe
         to invoke this in an IPI, but that's just wishful thinking.
      
         FPU state of a running task is protected by fregs_lock() which is
         nothing else than a local_bh_disable(). As BH-disabled regions run
         usually with interrupts enabled the IPI can hit a code section which
         modifies FPU state and there is absolutely no guarantee that any of the
         assumptions which are made for the IPI case is true.
      
         Also the IPI is sent to all CPUs in mm_cpumask(mm), but the IPI is
         invoked with a NULL pointer argument, so it can hit a completely
         unrelated task and unconditionally force an update for nothing.
         Worse, it can hit a kernel thread which operates on a user space
         address space and set a random PASID for it.
      
      The offending commit does not cleanly revert, but it's sufficient to
      force disable X86_FEATURE_ENQCMD and to remove the broken update_pasid()
      code to make this dysfunctional all over the place. Anything more
      complex would require more surgery and none of the related functions
      outside of the x86 core code are blatantly wrong, so removing those
      would be overkill.
      
      As nothing enables the PASID bit in the IA32_XSS MSR yet, which is
      required to make this actually work, this cannot result in a regression
      except for related out of tree train-wrecks, but they are broken already
      today.
      
      Fixes: 20f0afd1 ("x86/mmu: Allocate/free a PASID")
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/87mtsd6gr9.ffs@nanos.tec.linutronix.de
      9bfecd05
  4. 01 6月, 2021 1 次提交
    • B
      x86/thermal: Fix LVT thermal setup for SMI delivery mode · 9a90ed06
      Borislav Petkov 提交于
      There are machines out there with added value crap^WBIOS which provide an
      SMI handler for the local APIC thermal sensor interrupt. Out of reset,
      the BSP on those machines has something like 0x200 in that APIC register
      (timestamps left in because this whole issue is timing sensitive):
      
        [    0.033858] read lvtthmr: 0x330, val: 0x200
      
      which means:
      
       - bit 16 - the interrupt mask bit is clear and thus that interrupt is enabled
       - bits [10:8] have 010b which means SMI delivery mode.
      
      Now, later during boot, when the kernel programs the local APIC, it
      soft-disables it temporarily through the spurious vector register:
      
        setup_local_APIC:
      
        	...
      
      	/*
      	 * If this comes from kexec/kcrash the APIC might be enabled in
      	 * SPIV. Soft disable it before doing further initialization.
      	 */
      	value = apic_read(APIC_SPIV);
      	value &= ~APIC_SPIV_APIC_ENABLED;
      	apic_write(APIC_SPIV, value);
      
      which means (from the SDM):
      
      "10.4.7.2 Local APIC State After It Has Been Software Disabled
      
      ...
      
      * The mask bits for all the LVT entries are set. Attempts to reset these
      bits will be ignored."
      
      And this happens too:
      
        [    0.124111] APIC: Switch to symmetric I/O mode setup
        [    0.124117] lvtthmr 0x200 before write 0xf to APIC 0xf0
        [    0.124118] lvtthmr 0x10200 after write 0xf to APIC 0xf0
      
      This results in CPU 0 soft lockups depending on the placement in time
      when the APIC soft-disable happens. Those soft lockups are not 100%
      reproducible and the reason for that can only be speculated as no one
      tells you what SMM does. Likely, it confuses the SMM code that the APIC
      is disabled and the thermal interrupt doesn't doesn't fire at all,
      leading to CPU 0 stuck in SMM forever...
      
      Now, before
      
        4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      
      due to how the APIC_LVTTHMR was read before APIC initialization in
      mcheck_intel_therm_init(), it would read the value with the mask bit 16
      clear and then intel_init_thermal() would replicate it onto the APs and
      all would be peachy - the thermal interrupt would remain enabled.
      
      But that commit moved that reading to a later moment in
      intel_init_thermal(), resulting in reading APIC_LVTTHMR on the BSP too
      late and with its interrupt mask bit set.
      
      Thus, revert back to the old behavior of reading the thermal LVT
      register before the APIC gets initialized.
      
      Fixes: 4f432e8b ("x86/mce: Get rid of mcheck_intel_therm_init()")
      Reported-by: NJames Feeney <james@nurealm.net>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Link: https://lkml.kernel.org/r/YKIqDdFNaXYd39wz@zn.tnic
      9a90ed06
  5. 29 5月, 2021 1 次提交
    • T
      x86/apic: Mark _all_ legacy interrupts when IO/APIC is missing · 7d65f9e8
      Thomas Gleixner 提交于
      PIC interrupts do not support affinity setting and they can end up on
      any online CPU. Therefore, it's required to mark the associated vectors
      as system-wide reserved. Otherwise, the corresponding irq descriptors
      are copied to the secondary CPUs but the vectors are not marked as
      assigned or reserved. This works correctly for the IO/APIC case.
      
      When the IO/APIC is disabled via config, kernel command line or lack of
      enumeration then all legacy interrupts are routed through the PIC, but
      nothing marks them as system-wide reserved vectors.
      
      As a consequence, a subsequent allocation on a secondary CPU can result in
      allocating one of these vectors, which triggers the BUG() in
      apic_update_vector() because the interrupt descriptor slot is not empty.
      
      Imran tried to work around that by marking those interrupts as allocated
      when a CPU comes online. But that's wrong in case that the IO/APIC is
      available and one of the legacy interrupts, e.g. IRQ0, has been switched to
      PIC mode because then marking them as allocated will fail as they are
      already marked as system vectors.
      
      Stay consistent and update the legacy vectors after attempting IO/APIC
      initialization and mark them as system vectors in case that no IO/APIC is
      available.
      
      Fixes: 69cde000 ("x86/vector: Use matrix allocator for vector assignment")
      Reported-by: NImran Khan <imran.f.khan@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20210519233928.2157496-1-imran.f.khan@oracle.com
      7d65f9e8
  6. 27 5月, 2021 1 次提交
  7. 14 5月, 2021 1 次提交
  8. 13 5月, 2021 1 次提交
  9. 10 5月, 2021 3 次提交
  10. 07 5月, 2021 9 次提交
  11. 06 5月, 2021 3 次提交
  12. 05 5月, 2021 1 次提交
  13. 03 5月, 2021 1 次提交
    • M
      KVM: nSVM: fix few bugs in the vmcb02 caching logic · c74ad08f
      Maxim Levitsky 提交于
      * Define and use an invalid GPA (all ones) for init value of last
        and current nested vmcb physical addresses.
      
      * Reset the current vmcb12 gpa to the invalid value when leaving
        the nested mode, similar to what is done on nested vmexit.
      
      * Reset	the last seen vmcb12 address when disabling the nested SVM,
        as it relies on vmcb02 fields which are freed at that point.
      
      Fixes: 4995a368 ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210503125446.1353307-3-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c74ad08f
  14. 01 5月, 2021 3 次提交
  15. 26 4月, 2021 1 次提交
  16. 22 4月, 2021 1 次提交
    • N
      KVM: x86: Support KVM VMs sharing SEV context · 54526d1f
      Nathan Tempelman 提交于
      Add a capability for userspace to mirror SEV encryption context from
      one vm to another. On our side, this is intended to support a
      Migration Helper vCPU, but it can also be used generically to support
      other in-guest workloads scheduled by the host. The intention is for
      the primary guest and the mirror to have nearly identical memslots.
      
      The primary benefits of this are that:
      1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
      can't accidentally clobber each other.
      2) The VMs can have different memory-views, which is necessary for post-copy
      migration (the migration vCPUs on the target need to read and write to
      pages, when the primary guest would VMEXIT).
      
      This does not change the threat model for AMD SEV. Any memory involved
      is still owned by the primary guest and its initial state is still
      attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
      to circumvent SEV, they could achieve the same effect by simply attaching
      a vCPU to the primary VM.
      This patch deliberately leaves userspace in charge of the memslots for the
      mirror, as it already has the power to mess with them in the primary guest.
      
      This patch does not support SEV-ES (much less SNP), as it does not
      handle handing off attested VMSAs to the mirror.
      
      For additional context, we need a Migration Helper because SEV PSP
      migration is far too slow for our live migration on its own. Using
      an in-guest migrator lets us speed this up significantly.
      Signed-off-by: NNathan Tempelman <natet@google.com>
      Message-Id: <20210408223214.2582277-1-natet@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54526d1f
  17. 21 4月, 2021 2 次提交
  18. 20 4月, 2021 7 次提交
    • C
      floppy: remove redundant assignment to variable st · b53002e0
      Colin Ian King 提交于
      The variable st is being assigned a value that is never read and
      it is being updated later with a new value. The initialization is
      redundant and can be removed.
      
      Addresses-Coverity: ("Unused value")
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Reviewed-by: NDenis Efremov <efremov@linux.com>
      Acked-by: NWilly Tarreau <w@1wt.eu>
      Link: https://lore.kernel.org/r/20210415130020.1959951-1-colin.king@canonical.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      b53002e0
    • S
      KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions · 70210c04
      Sean Christopherson 提交于
      Add an ECREATE handler that will be used to intercept ECREATE for the
      purpose of enforcing and enclave's MISCSELECT, ATTRIBUTES and XFRM, i.e.
      to allow userspace to restrict SGX features via CPUID.  ECREATE will be
      intercepted when any of the aforementioned masks diverges from hardware
      in order to enforce the desired CPUID model, i.e. inject #GP if the
      guest attempts to set a bit that hasn't been enumerated as allowed-1 in
      CPUID.
      
      Note, access to the PROVISIONKEY is not yet supported.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Co-developed-by: NKai Huang <kai.huang@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <c3a97684f1b71b4f4626a1fc3879472a95651725.1618196135.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      70210c04
    • S
      KVM: VMX: Add basic handling of VM-Exit from SGX enclave · 3c0c2ad1
      Sean Christopherson 提交于
      Add support for handling VM-Exits that originate from a guest SGX
      enclave.  In SGX, an "enclave" is a new CPL3-only execution environment,
      wherein the CPU and memory state is protected by hardware to make the
      state inaccesible to code running outside of the enclave.  When exiting
      an enclave due to an asynchronous event (from the perspective of the
      enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
      is automatically saved and scrubbed (the CPU loads synthetic state), and
      then reloaded when re-entering the enclave.  E.g. after an instruction
      based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
      of the enclave instruction that trigered VM-Exit, but will instead point
      to a RIP in the enclave's untrusted runtime (the guest userspace code
      that coordinates entry/exit to/from the enclave).
      
      To help a VMM recognize and handle exits from enclaves, SGX adds bits to
      existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
      GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR.  Define the
      new architectural bits, and add a boolean to struct vcpu_vmx to cache
      VMX_EXIT_REASON_FROM_ENCLAVE.  Clear the bit in exit_reason so that
      checks against exit_reason do not need to account for SGX, e.g.
      "if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
      
      KVM is a largely a passive observer of the new bits, e.g. KVM needs to
      account for the bits when propagating information to a nested VMM, but
      otherwise doesn't need to act differently for the majority of VM-Exits
      from enclaves.
      
      The one scenario that is directly impacted is emulation, which is for
      all intents and purposes impossible[1] since KVM does not have access to
      the RIP or instruction stream that triggered the VM-Exit.  The inability
      to emulate is a non-issue for KVM, as most instructions that might
      trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
      check.  For the few instruction that conditionally #UD, KVM either never
      sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
      if the feature is not exposed to the guest in order to inject a #UD,
      e.g. RDRAND_EXITING.
      
      But, because it is still possible for a guest to trigger emulation,
      e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
      from an enclave.  This is architecturally accurate for instruction
      VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
      to killing the VM.  In practice, only broken or particularly stupid
      guests should ever encounter this behavior.
      
      Add a WARN in skip_emulated_instruction to detect any attempt to
      modify the guest's RIP during an SGX enclave VM-Exit as all such flows
      should either be unreachable or must handle exits from enclaves before
      getting to skip_emulated_instruction.
      
      [1] Impossible for all practical purposes.  Not truly impossible
          since KVM could implement some form of para-virtualization scheme.
      
      [2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
          CPL3, so we also don't need to worry about that interaction.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c0c2ad1
    • S
      KVM: x86: Define new #PF SGX error code bit · 00e7646c
      Sean Christopherson 提交于
      Page faults that are signaled by the SGX Enclave Page Cache Map (EPCM),
      as opposed to the traditional IA32/EPT page tables, set an SGX bit in
      the error code to indicate that the #PF was induced by SGX.  KVM will
      need to emulate this behavior as part of its trap-and-execute scheme for
      virtualizing SGX Launch Control, e.g. to inject SGX-induced #PFs if
      EINIT faults in the host, and to support live migration.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <e170c5175cb9f35f53218a7512c9e3db972b97a2.1618196135.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00e7646c
    • K
      KVM: x86: Remove unused function declaration · d90b15ed
      Keqian Zhu 提交于
      kvm_mmu_slot_largepage_remove_write_access() is decared but not used,
      just remove it.
      Signed-off-by: NKeqian Zhu <zhukeqian1@huawei.com>
      Message-Id: <20210406063504.17552-1-zhukeqian1@huawei.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d90b15ed
    • W
      KVM: X86: Count attempted/successful directed yield · 4a7132ef
      Wanpeng Li 提交于
      To analyze some performance issues with lock contention and scheduling,
      it is nice to know when directed yield are successful or failing.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4a7132ef
    • K
      perf/x86/intel: Hybrid PMU support for perf capabilities · d0946a88
      Kan Liang 提交于
      Some platforms, e.g. Alder Lake, have hybrid architecture. Although most
      PMU capabilities are the same, there are still some unique PMU
      capabilities for different hybrid PMUs. Perf should register a dedicated
      pmu for each hybrid PMU.
      
      Add a new struct x86_hybrid_pmu, which saves the dedicated pmu and
      capabilities for each hybrid PMU.
      
      The architecture MSR, MSR_IA32_PERF_CAPABILITIES, only indicates the
      architecture features which are available on all hybrid PMUs. The
      architecture features are stored in the global x86_pmu.intel_cap.
      
      For Alder Lake, the model-specific features are perf metrics and
      PEBS-via-PT. The corresponding bits of the global x86_pmu.intel_cap
      should be 0 for these two features. Perf should not use the global
      intel_cap to check the features on a hybrid system.
      Add a dedicated intel_cap in the x86_hybrid_pmu to store the
      model-specific capabilities. Use the dedicated intel_cap to replace
      the global intel_cap for thse two features. The dedicated intel_cap
      will be set in the following "Add Alder Lake Hybrid support" patch.
      
      Add is_hybrid() to distinguish a hybrid system. ADL may have an
      alternative configuration. With that configuration, the
      X86_FEATURE_HYBRID_CPU is not set. Perf cannot rely on the feature bit.
      Add a new static_key_false, perf_is_hybrid, to indicate a hybrid system.
      It will be assigned in the following "Add Alder Lake Hybrid support"
      patch as well.
      Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1618237865-33448-5-git-send-email-kan.liang@linux.intel.com
      d0946a88