1. 08 8月, 2018 2 次提交
  2. 07 8月, 2018 1 次提交
    • T
      cpu/hotplug: Fix SMT supported evaluation · bc2d8d26
      Thomas Gleixner 提交于
      Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
      cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
      on the kernel command line as it cannot differentiate between SMT disabled
      by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
      makes the sysfs interface unusable.
      
      Rework this so that during bringup of the non boot CPUs the availability of
      SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
      'primary' thread then set the local cpu_smt_available marker and evaluate
      this explicitely right after the initial SMP bringup has finished.
      
      SMT evaulation on x86 is a trainwreck as the firmware has all the
      information _before_ booting the kernel, but there is no interface to query
      it.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc2d8d26
  3. 05 8月, 2018 11 次提交
    • P
      KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf
      Paolo Bonzini 提交于
      When nested virtualization is in use, VMENTER operations from the nested
      hypervisor into the nested guest will always be processed by the bare metal
      hypervisor, and KVM's "conditional cache flushes" mode in particular does a
      flush on nested vmentry.  Therefore, include the "skip L1D flush on
      vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
      
      Add the relevant Documentation.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b76a3cf
    • P
      x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry · 8e0b2b91
      Paolo Bonzini 提交于
      Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
      not needed.  Add a new value to enum vmx_l1d_flush_state, which is used
      either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      8e0b2b91
    • P
      x86/speculation: Simplify sysfs report of VMX L1TF vulnerability · ea156d19
      Paolo Bonzini 提交于
      Three changes to the content of the sysfs file:
      
       - If EPT is disabled, L1TF cannot be exploited even across threads on the
         same core, and SMT is irrelevant.
      
       - If mitigation is completely disabled, and SMT is enabled, print "vulnerable"
         instead of "vulnerable, SMT vulnerable"
      
       - Reorder the two parts so that the main vulnerability state comes first
         and the detail on SMT is second.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ea156d19
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() · 18b57ce2
      Nicolai Stange 提交于
      For VMEXITs caused by external interrupts, vmx_handle_external_intr()
      indirectly calls into the interrupt handlers through the host's IDT.
      
      It follows that these interrupts get accounted for in the
      kvm_cpu_l1tf_flush_l1d per-cpu flag.
      
      The subsequently executed vmx_l1d_flush() will thus be aware that some
      interrupts have happened and conduct a L1d flush anyway.
      
      Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed
      anymore. Drop it.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      18b57ce2
    • N
      x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d · ffcba43f
      Nicolai Stange 提交于
      The last missing piece to having vmx_l1d_flush() take interrupts after
      VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
      irq entry.
      
      Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
      ipi_entering_ack_irq(), smp_reschedule_interrupt() and
      uv_bau_message_interrupt().
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ffcba43f
    • N
      x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316
      Nicolai Stange 提交于
      The next patch in this series will have to make the definition of
      irq_cpustat_t available to entering_irq().
      
      Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
      dependencies like
      
        asm/smp.h
          asm/apic.h
            asm/hardirq.h
              linux/irq.h
                linux/topology.h
                  linux/smp.h
                    asm/smp.h
      
      or
      
        linux/gfp.h
          linux/mmzone.h
            asm/mmzone.h
              asm/mmzone_64.h
                asm/smp.h
                  asm/apic.h
                    asm/hardirq.h
                      linux/irq.h
                        linux/irqdesc.h
                          linux/kobject.h
                            linux/sysfs.h
                              linux/kernfs.h
                                linux/idr.h
                                  linux/gfp.h
      
      and others.
      
      This causes compilation errors because of the header guards becoming
      effective in the second inclusion: symbols/macros that had been defined
      before wouldn't be available to intermediate headers in the #include chain
      anymore.
      
      A possible workaround would be to move the definition of irq_cpustat_t
      into its own header and include that from both, asm/hardirq.h and
      asm/apic.h.
      
      However, this wouldn't solve the real problem, namely asm/harirq.h
      unnecessarily pulling in all the linux/irq.h cruft: nothing in
      asm/hardirq.h itself requires it. Also, note that there are some other
      archs, like e.g. arm64, which don't have that #include in their
      asm/hardirq.h.
      
      Remove the linux/irq.h #include from x86' asm/hardirq.h.
      
      Fix resulting compilation errors by adding appropriate #includes to *.c
      files as needed.
      
      Note that some of these *.c files could be cleaned up a bit wrt. to their
      set of #includes, but that should better be done from separate patches, if
      at all.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      447ae316
    • N
      x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0
      Nicolai Stange 提交于
      Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
      VMENTRY.
      
      L1D flushes are costly and two modes of operations are provided to users:
      "always" and the more selective "conditional" mode.
      
      If operating in the latter, the cache would get flushed only if a host side
      code path considered unconfined had been traversed. "Unconfined" in this
      context means that it might have pulled in sensitive data like user data
      or kernel crypto keys.
      
      The need for L1D flushes is tracked by means of the per-vcpu flag
      l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
      vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
      L1d flush based on its value and reset that flag again.
      
      Currently, interrupts delivered "normally" while in root operation between
      VMEXIT and VMENTER are not taken into account. Part of the reason is that
      these don't leave any traces and thus, the vmx code is unable to tell if
      any such has happened.
      
      As proposed by Paolo Bonzini, prepare for tracking all interrupts by
      introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
      strong analogy to the per-vcpu ->l1tf_flush_l1d.
      
      A later patch will make interrupt handlers set it.
      
      For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
      per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
      
      Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
      kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
      trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
      
      Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
      l1tf_flush_l1d.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      45b575c0
    • N
      x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 · 9aee5f8a
      Nicolai Stange 提交于
      An upcoming patch will extend KVM's L1TF mitigation in conditional mode
      to also cover interrupts after VMEXITs. For tracking those, stores to a
      new per-cpu flag from interrupt handlers will become necessary.
      
      In order to improve cache locality, this new flag will be added to x86's
      irq_cpustat_t.
      
      Make some space available there by shrinking the ->softirq_pending bitfield
      from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
      i.e. 10.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      9aee5f8a
    • N
      x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() · 5b6ccc6c
      Nicolai Stange 提交于
      Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
      vmx_l1d_flush() if so.
      
      This test is unncessary for the "always flush L1D" mode.
      
      Move the check to vmx_l1d_flush()'s conditional mode code path.
      
      Notes:
      - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
        extra function call.
        
      - This inverts the (static) branch prediction, but there hadn't been any
        explicit likely()/unlikely() annotations before and so it stays as is.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5b6ccc6c
    • N
      x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' · 427362a1
      Nicolai Stange 提交于
      The vmx_l1d_flush_always static key is only ever evaluated if
      vmx_l1d_should_flush is enabled. In that case however, there are only two
      L1d flushing modes possible: "always" and "conditional".
      
      The "conditional" mode's implementation tends to require more sophisticated
      logic than the "always" mode.
      
      Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
      with a 'vmx_l1d_flush_cond' one.
      
      There is no change in functionality.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      427362a1
    • N
      x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() · 379fd0c7
      Nicolai Stange 提交于
      vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
      point in setting l1tf_flush_l1d to true from there again.
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      379fd0c7
  4. 27 7月, 2018 2 次提交
  5. 19 7月, 2018 1 次提交
    • N
      x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content · 288d152c
      Nicolai Stange 提交于
      The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
      to evict the L1d cache.
      
      However, these pages are never cleared and, in theory, their data could be
      leaked.
      
      More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
      to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
      the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.
      
      Fix this by initializing the individual vmx_l1d_flush_pages with a
      different pattern each.
      
      Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
      "flush_pages" to reflect this change.
      
      Fixes: a47dd5f0 ("x86/KVM/VMX: Add L1D flush algorithm")
      Signed-off-by: NNicolai Stange <nstange@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      288d152c
  6. 18 7月, 2018 2 次提交
    • P
      kvmclock: fix TSC calibration for nested guests · e10f7805
      Peng Hao 提交于
      Inside a nested guest, access to hardware can be slow enough that
      tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
      to be called periodically and the nested guest to spend a lot of time
      reading the ACPI timer.
      
      However, if the TSC frequency is available from the pvclock page,
      we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
      'refine' operation.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      [Commit message rewritten. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e10f7805
    • L
      KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled · 2307af1c
      Liran Alon 提交于
      When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
      with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
      by MSR_IA32_VMX_BASIC.
      
      However, even though not explictly documented by TLFS, VMXArea passed
      as VMXON argument should still be marked with revision_id reported by
      physical CPU.
      
      This issue was found by the following setup:
      * L0 = KVM which expose eVMCS to it's L1 guest.
      * L1 = KVM which consume eVMCS reported by L0.
      This setup caused the following to occur:
      1) L1 execute hardware_enable().
      2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
      3) L0 intercept L1 VMXON and execute handle_vmon() which notes
      vmxarea->revision_id != VMCS12_REVISION and therefore fails with
      nested_vmx_failInvalid() which sets RFLAGS.CF.
      4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
      hardware_enable() continues as usual.
      5) L1 hardware_enable() then calls ept_sync_global() which executes
      INVEPT.
      6) L0 intercept INVEPT and execute handle_invept() which notes
      !vmx->nested.vmxon and thus raise a #UD to L1.
      7) Raised #UD caused L1 to panic.
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: stable@vger.kernel.org
      Fixes: 773e8a04Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2307af1c
  7. 17 7月, 2018 1 次提交
  8. 16 7月, 2018 2 次提交
    • V
      x86/apm: Don't access __preempt_count with zeroed fs · 6f6060a5
      Ville Syrjälä 提交于
      APM_DO_POP_SEGS does not restore fs/gs which were zeroed by
      APM_DO_ZERO_SEGS. Trying to access __preempt_count with
      zeroed fs doesn't really work.
      
      Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS
      invocations so that fs is actually restored before calling
      preempt_enable().
      
      Fixes the following sort of oopses:
      [    0.313581] general protection fault: 0000 [#1] PREEMPT SMP
      [    0.313803] Modules linked in:
      [    0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a #19
      [    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170
      [    0.316161] EFLAGS: 00210016 CPU: 0
      [    0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000
      [    0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0
      [    0.316161]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
      [    0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0
      [    0.316161] Call Trace:
      [    0.316161]  ? cpumask_weight.constprop.15+0x20/0x20
      [    0.316161]  on_cpu0+0x44/0x70
      [    0.316161]  apm+0x54e/0x720
      [    0.316161]  ? __switch_to_asm+0x26/0x40
      [    0.316161]  ? __schedule+0x17d/0x590
      [    0.316161]  kthread+0xc0/0xf0
      [    0.316161]  ? proc_apm_show+0x150/0x150
      [    0.316161]  ? kthread_create_worker_on_cpu+0x20/0x20
      [    0.316161]  ret_from_fork+0x2e/0x38
      [    0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45
      [    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0
      [    0.316161] ---[ end trace 656253db2deaa12c ]---
      
      Fixes: dd84441a ("x86/speculation: Use IBRS if available before calling into firmware")
      Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc:  David Woodhouse <dwmw@amazon.co.uk>
      Cc:  "H. Peter Anvin" <hpa@zytor.com>
      Cc:  x86@kernel.org
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com
      6f6060a5
    • D
      x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling · 092b31aa
      Dan Williams 提交于
      All copy_to_user() implementations need to be prepared to handle faults
      accessing userspace. The __memcpy_mcsafe() implementation handles both
      mmu-faults on the user destination and machine-check-exceptions on the
      source buffer. However, the memcpy_mcsafe() wrapper may silently
      fallback to memcpy() depending on build options and cpu-capabilities.
      
      Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
      available, and otherwise disable all of the copy_to_user_mcsafe()
      infrastructure when __memcpy_mcsafe() is not available, i.e.
      CONFIG_X86_MCE=n.
      
      This fixes crashes of the form:
          run fstests generic/323 at 2018-07-02 12:46:23
          BUG: unable to handle kernel paging request at 00007f0d50001000
          RIP: 0010:__memcpy+0x12/0x20
          [..]
          Call Trace:
           copyout_mcsafe+0x3a/0x50
           _copy_to_iter_mcsafe+0xa1/0x4a0
           ? dax_alive+0x30/0x50
           dax_iomap_actor+0x1f9/0x280
           ? dax_iomap_rw+0x100/0x100
           iomap_apply+0xba/0x130
           ? dax_iomap_rw+0x100/0x100
           dax_iomap_rw+0x95/0x100
           ? dax_iomap_rw+0x100/0x100
           xfs_file_dax_read+0x7b/0x1d0 [xfs]
           xfs_file_read_iter+0xa7/0xc0 [xfs]
           aio_read+0x11c/0x1a0
      Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Fixes: 8780356e ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
      Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      092b31aa
  9. 15 7月, 2018 7 次提交
  10. 13 7月, 2018 10 次提交
  11. 12 7月, 2018 1 次提交