提交 · 0768f91530ff46683e0b372df14fd79fe8d156e5 · openanolis / cloud-kernel

08 8月, 2018 2 次提交

x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert · 0768f915

由 Andi Kleen 提交于 8月 07, 2018

Some cases in THP like:
  - MADV_FREE
  - mprotect
  - split

mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.

Use the proper low level functions for pmd/pud_mknotpresent() to address
this.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

0768f915

x86/speculation/l1tf: Invert all not present mappings · f22cc87f

由 Andi Kleen 提交于 8月 07, 2018

For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.

Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

f22cc87f

07 8月, 2018 1 次提交

cpu/hotplug: Fix SMT supported evaluation · bc2d8d26

由 Thomas Gleixner 提交于 8月 07, 2018

Josh reported that the late SMT evaluation in cpu_smt_state_init() sets
cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied
on the kernel command line as it cannot differentiate between SMT disabled
by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and
makes the sysfs interface unusable.

Rework this so that during bringup of the non boot CPUs the availability of
SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a
'primary' thread then set the local cpu_smt_available marker and evaluate
this explicitely right after the initial SMP bringup has finished.

SMT evaulation on x86 is a trainwreck as the firmware has all the
information _before_ booting the kernel, but there is no interface to query
it.

Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

bc2d8d26

05 8月, 2018 11 次提交

KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf

由 Paolo Bonzini 提交于 8月 05, 2018

When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry.  Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.

Add the relevant Documentation.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

5b76a3cf

x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry · 8e0b2b91

由 Paolo Bonzini 提交于 8月 05, 2018

Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed. Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

8e0b2b91

x86/speculation: Simplify sysfs report of VMX L1TF vulnerability · ea156d19

由 Paolo Bonzini 提交于 8月 05, 2018

Three changes to the content of the sysfs file:

 - If EPT is disabled, L1TF cannot be exploited even across threads on the
   same core, and SMT is irrelevant.

 - If mitigation is completely disabled, and SMT is enabled, print "vulnerable"
   instead of "vulnerable, SMT vulnerable"

 - Reorder the two parts so that the main vulnerability state comes first
   and the detail on SMT is second.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

ea156d19

x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr() · 18b57ce2

由 Nicolai Stange 提交于 7月 22, 2018

For VMEXITs caused by external interrupts, vmx_handle_external_intr()
indirectly calls into the interrupt handlers through the host's IDT.

It follows that these interrupts get accounted for in the
kvm_cpu_l1tf_flush_l1d per-cpu flag.

The subsequently executed vmx_l1d_flush() will thus be aware that some
interrupts have happened and conduct a L1d flush anyway.

Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed
anymore. Drop it.
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

18b57ce2

x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d · ffcba43f

由 Nicolai Stange 提交于 7月 29, 2018

The last missing piece to having vmx_l1d_flush() take interrupts after
VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
irq entry.

Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
ipi_entering_ack_irq(), smp_reschedule_interrupt() and
uv_bau_message_interrupt().
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

ffcba43f

x86: Don't include linux/irq.h from asm/hardirq.h · 447ae316

由 Nicolai Stange 提交于 7月 29, 2018

The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().

Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like

  asm/smp.h
    asm/apic.h
      asm/hardirq.h
        linux/irq.h
          linux/topology.h
            linux/smp.h
              asm/smp.h

or

  linux/gfp.h
    linux/mmzone.h
      asm/mmzone.h
        asm/mmzone_64.h
          asm/smp.h
            asm/apic.h
              asm/hardirq.h
                linux/irq.h
                  linux/irqdesc.h
                    linux/kobject.h
                      linux/sysfs.h
                        linux/kernfs.h
                          linux/idr.h
                            linux/gfp.h

and others.

This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.

A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.

However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.

Remove the linux/irq.h #include from x86' asm/hardirq.h.

Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.

Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

447ae316

x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d · 45b575c0

由 Nicolai Stange 提交于 7月 27, 2018

Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.

L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.

If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.

The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.

Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.

As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.

A later patch will make interrupt handlers set it.

For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.

Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.

Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>

45b575c0

x86/irq: Demote irq_cpustat_t::__softirq_pending to u16 · 9aee5f8a

由 Nicolai Stange 提交于 7月 27, 2018

An upcoming patch will extend KVM's L1TF mitigation in conditional mode
to also cover interrupts after VMEXITs. For tracking those, stores to a
new per-cpu flag from interrupt handlers will become necessary.

In order to improve cache locality, this new flag will be added to x86's
irq_cpustat_t.

Make some space available there by shrinking the ->softirq_pending bitfield
from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
i.e. 10.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>

9aee5f8a

x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush() · 5b6ccc6c

由 Nicolai Stange 提交于 7月 21, 2018

Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes
vmx_l1d_flush() if so.

This test is unncessary for the "always flush L1D" mode.

Move the check to vmx_l1d_flush()'s conditional mode code path.

Notes:
- vmx_l1d_flush() is likely to get inlined anyway and thus, there's no
  extra function call.
  
- This inverts the (static) branch prediction, but there hadn't been any
  explicit likely()/unlikely() annotations before and so it stays as is.
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

5b6ccc6c

x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond' · 427362a1

由 Nicolai Stange 提交于 7月 21, 2018

The vmx_l1d_flush_always static key is only ever evaluated if
vmx_l1d_should_flush is enabled. In that case however, there are only two
L1d flushing modes possible: "always" and "conditional".

The "conditional" mode's implementation tends to require more sophisticated
logic than the "always" mode.

Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key
with a 'vmx_l1d_flush_cond' one.

There is no change in functionality.
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

427362a1

x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush() · 379fd0c7

由 Nicolai Stange 提交于 7月 21, 2018

vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no
point in setting l1tf_flush_l1d to true from there again.
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

379fd0c7

27 7月, 2018 2 次提交

kvm, mm: account shadow page tables to kmemcg · d97e5e61

由 Shakeel Butt 提交于 7月 26, 2018

The size of kvm's shadow page tables corresponds to the size of the
guest virtual machines on the system.  Large VMs can spend a significant
amount of memory as shadow page tables which can not be left as system
memory overhead.  So, account shadow page tables to the kmemcg.

[shakeelb@google.com: replace (GFP_KERNEL|__GFP_ACCOUNT) with GFP_KERNEL_ACCOUNT]
  Link: http://lkml.kernel.org/r/20180629140224.205849-1-shakeelb@google.com
Link: http://lkml.kernel.org/r/20180627181349.149778-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Feiner <pfeiner@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d97e5e61

mm: use vma_init() to initialize VMAs on stack and data segments · 2c4541e2

由 Kirill A. Shutemov 提交于 7月 26, 2018

Make sure to initialize all VMAs properly, not only those which come
from vm_area_cachep.

Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2c4541e2

19 7月, 2018 1 次提交

x86/KVM/VMX: Initialize the vmx_l1d_flush_pages' content · 288d152c

由 Nicolai Stange 提交于 7月 18, 2018

The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order
to evict the L1d cache.

However, these pages are never cleared and, in theory, their data could be
leaked.

More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages
to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break
the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses.

Fix this by initializing the individual vmx_l1d_flush_pages with a
different pattern each.

Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to
"flush_pages" to reflect this change.

Fixes: a47dd5f0 ("x86/KVM/VMX: Add L1D flush algorithm")
Signed-off-by: NNicolai Stange <nstange@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

288d152c

18 7月, 2018 2 次提交

kvmclock: fix TSC calibration for nested guests · e10f7805

由 Peng Hao 提交于 7月 14, 2018

Inside a nested guest, access to hardware can be slow enough that
tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work
to be called periodically and the nested guest to spend a lot of time
reading the ACPI timer.

However, if the TSC frequency is available from the pvclock page,
we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration.
'refine' operation.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
[Commit message rewritten. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e10f7805

KVM: VMX: Mark VMXArea with revision_id of physical CPU even when eVMCS enabled · 2307af1c

由 Liran Alon 提交于 6月 29, 2018

When eVMCS is enabled, all VMCS allocated to be used by KVM are marked
with revision_id of KVM_EVMCS_VERSION instead of revision_id reported
by MSR_IA32_VMX_BASIC.

However, even though not explictly documented by TLFS, VMXArea passed
as VMXON argument should still be marked with revision_id reported by
physical CPU.

This issue was found by the following setup:
* L0 = KVM which expose eVMCS to it's L1 guest.
* L1 = KVM which consume eVMCS reported by L0.
This setup caused the following to occur:
1) L1 execute hardware_enable().
2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON.
3) L0 intercept L1 VMXON and execute handle_vmon() which notes
vmxarea->revision_id != VMCS12_REVISION and therefore fails with
nested_vmx_failInvalid() which sets RFLAGS.CF.
4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore
hardware_enable() continues as usual.
5) L1 hardware_enable() then calls ept_sync_global() which executes
INVEPT.
6) L0 intercept INVEPT and execute handle_invept() which notes
!vmx->nested.vmxon and thus raise a #UD to L1.
7) Raised #UD caused L1 to panic.
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: stable@vger.kernel.org
Fixes: 773e8a04Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2307af1c

17 7月, 2018 1 次提交

x86/MCE: Remove min interval polling limitation · fbdb328c

由 Dewet Thibaut 提交于 7月 16, 2018

commit b3b7c479 ("x86/MCE: Serialize sysfs changes") introduced a min
interval limitation when setting the check interval for polled MCEs.
However, the logic is that 0 disables polling for corrected MCEs, see
Documentation/x86/x86_64/machinecheck. The limitation prevents disabling.

Remove this limitation and allow the value 0 to disable polling again.

Fixes: b3b7c479 ("x86/MCE: Serialize sysfs changes")
Signed-off-by: NDewet Thibaut <thibaut.dewet@nokia.com>
Signed-off-by: NAlexander Sverdlin <alexander.sverdlin@nokia.com>
[ Massage commit message. ]
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20180716084927.24869-1-alexander.sverdlin@nokia.com

fbdb328c

16 7月, 2018 2 次提交

x86/apm: Don't access __preempt_count with zeroed fs · 6f6060a5

由 Ville Syrjälä 提交于 7月 09, 2018

APM_DO_POP_SEGS does not restore fs/gs which were zeroed by
APM_DO_ZERO_SEGS. Trying to access __preempt_count with
zeroed fs doesn't really work.

Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS
invocations so that fs is actually restored before calling
preempt_enable().

Fixes the following sort of oopses:
[    0.313581] general protection fault: 0000 [#1] PREEMPT SMP
[    0.313803] Modules linked in:
[    0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a #19
[    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170
[    0.316161] EFLAGS: 00210016 CPU: 0
[    0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000
[    0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0
[    0.316161]  DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068
[    0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0
[    0.316161] Call Trace:
[    0.316161]  ? cpumask_weight.constprop.15+0x20/0x20
[    0.316161]  on_cpu0+0x44/0x70
[    0.316161]  apm+0x54e/0x720
[    0.316161]  ? __switch_to_asm+0x26/0x40
[    0.316161]  ? __schedule+0x17d/0x590
[    0.316161]  kthread+0xc0/0xf0
[    0.316161]  ? proc_apm_show+0x150/0x150
[    0.316161]  ? kthread_create_worker_on_cpu+0x20/0x20
[    0.316161]  ret_from_fork+0x2e/0x38
[    0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45
[    0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0
[    0.316161] ---[ end trace 656253db2deaa12c ]---

Fixes: dd84441a ("x86/speculation: Use IBRS if available before calling into firmware")
Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc:  David Woodhouse <dwmw@amazon.co.uk>
Cc:  "H. Peter Anvin" <hpa@zytor.com>
Cc:  x86@kernel.org
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com

6f6060a5

x86/asm/memcpy_mcsafe: Fix copy_to_user_mcsafe() exception handling · 092b31aa

由 Dan Williams 提交于 7月 08, 2018

All copy_to_user() implementations need to be prepared to handle faults
accessing userspace. The __memcpy_mcsafe() implementation handles both
mmu-faults on the user destination and machine-check-exceptions on the
source buffer. However, the memcpy_mcsafe() wrapper may silently
fallback to memcpy() depending on build options and cpu-capabilities.

Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
available, and otherwise disable all of the copy_to_user_mcsafe()
infrastructure when __memcpy_mcsafe() is not available, i.e.
CONFIG_X86_MCE=n.

This fixes crashes of the form:
    run fstests generic/323 at 2018-07-02 12:46:23
    BUG: unable to handle kernel paging request at 00007f0d50001000
    RIP: 0010:__memcpy+0x12/0x20
    [..]
    Call Trace:
     copyout_mcsafe+0x3a/0x50
     _copy_to_iter_mcsafe+0xa1/0x4a0
     ? dax_alive+0x30/0x50
     dax_iomap_actor+0x1f9/0x280
     ? dax_iomap_rw+0x100/0x100
     iomap_apply+0xba/0x130
     ? dax_iomap_rw+0x100/0x100
     dax_iomap_rw+0x95/0x100
     ? dax_iomap_rw+0x100/0x100
     xfs_file_dax_read+0x7b/0x1d0 [xfs]
     xfs_file_read_iter+0xa7/0xc0 [xfs]
     aio_read+0x11c/0x1a0
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Fixes: 8780356e ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

092b31aa

15 7月, 2018 7 次提交

x86/kvmclock: set pvti_cpu0_va after enabling kvmclock · 94ffba48

由 Radim Krčmář 提交于 7月 15, 2018

pvti_cpu0_va is the address of shared kvmclock data structure.

pvti_cpu0_va is currently kept unset (1) on 32 bit systems, (2) when
kvmclock vsyscall is disabled, and (3) if kvmclock is not stable.
This poses a problem, because kvm_ptp needs pvti_cpu0_va, but (1) can
work on 32 bit, (2) has little relation to the vsyscall, and (3) does
not need stable kvmclock (although kvmclock won't be used for system
clock if it's not stable, so kvm_ptp is pointless in that case).

Expose pvti_cpu0_va whenever kvmclock is enabled to allow all users to
work with it.

This fixes a regression found on Gentoo: https://bugs.gentoo.org/658544.

Fixes: 9f08890a ("x86/pvclock: add setter for pvclock_pvti_cpu0_va")
Cc: stable@vger.kernel.org
Reported-by: NAndreas Steinmetz <ast@domdv.de>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

94ffba48

x86/kvm/Kconfig: Ensure CRYPTO_DEV_CCP_DD state at minimum matches KVM_AMD · d30f370d

由 Janakarajan Natarajan 提交于 6月 27, 2018

Prevent a config where KVM_AMD=y and CRYPTO_DEV_CCP_DD=m thereby ensuring
that AMD Secure Processor device driver will be built-in when KVM_AMD is
also built-in.

v1->v2:
* Removed usage of 'imply' Kconfig option.
* Change patch commit message.

Fixes: 505c9e94 ("KVM: x86: prefer "depends on" to "select" for SEV")

Cc: <stable@vger.kernel.org> # 4.16.x
Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d30f370d

kvm: nVMX: Restore exit qual for VM-entry failure due to MSR loading · 0b88abdc

由 Jim Mattson 提交于 5月 30, 2018

This exit qualification was inadvertently dropped when the two
VM-entry failure blocks were coalesced.

Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest")
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b88abdc

x86/kvm/vmx: don't read current->thread.{fs,gs}base of legacy tasks · b062b794

由 Vitaly Kuznetsov 提交于 7月 11, 2018

When we switched from doing rdmsr() to reading FS/GS base values from
current->thread we completely forgot about legacy 32-bit userspaces which
we still support in KVM (why?). task->thread.{fsbase,gsbase} are only
synced for 64-bit processes, calling save_fsgs_for_kvm() and using
its result from current is illegal for legacy processes.

There's no ARCH_SET_FS/GS prctls for legacy applications. Base MSRs are,
however, not always equal to zero. Intel's manual says (3.4.4 Segment
Loading Instructions in IA-32e Mode):

"In order to set up compatibility mode for an application, segment-load
instructions (MOV to Sreg, POP Sreg) work normally in 64-bit mode. An
entry is read from the system descriptor table (GDT or LDT) and is loaded
in the hidden portion of the segment register.
...
The hidden descriptor register fields for FS.base and GS.base are
physically mapped to MSRs in order to load all address bits supported by
a 64-bit implementation.
"

The issue was found by strace test suite where 32-bit ioctl_kvm_run test
started segfaulting.
Reported-by: NDmitry V. Levin <ldv@altlinux.org>
Bisected-by: NMasatake YAMATO <yamato@redhat.com>
Fixes: 42b933b5 ("x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->thread")
Cc: stable@vger.kernel.org
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b062b794

KVM: VMX: support MSR_IA32_ARCH_CAPABILITIES as a feature MSR · cd283252

由 Paolo Bonzini 提交于 6月 25, 2018

This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all
requested features are available on the host.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cd283252

x86/events/intel/ds: Fix bts_interrupt_threshold alignment · 2c991e40

由 Hugh Dickins 提交于 7月 14, 2018

Markus reported that BTS is sporadically missing the tail of the trace
in the perf_event data buffer: [decode error (1): instruction overflow]
shown in GDB; and bisected it to the conversion of debug_store to PTI.

A little "optimization" crept into alloc_bts_buffer(), which mistakenly
placed bts_interrupt_threshold away from the 24-byte record boundary.
Intel SDM Vol 3B 17.4.9 says "This address must point to an offset from
the BTS buffer base that is a multiple of the BTS record size."

Revert "max" from a byte count to a record count, to calculate the
bts_interrupt_threshold correctly: which turns out to fix problem seen.

Fixes: c1961a46 ("x86/events/intel/ds: Map debug buffers in cpu_entry_area")
Reported-and-tested-by: NMarkus T Metzger <markus.t.metzger@intel.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@intel.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: stable@vger.kernel.org # v4.14+
Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1807141248290.1614@eggly.anvils

2c991e40

x86/purgatory: add missing FORCE to Makefile target · fa8cbda8

由 Philipp Rudo 提交于 7月 13, 2018

- Build the kernel without the fix
- Add some flag to the purgatories KBUILD_CFLAGS,I used
  -fno-asynchronous-unwind-tables
- Re-build the kernel

When you look at makes output you see that sha256.o is not re-build in the
last step.  Also readelf -S still shows the .eh_frame section for
sha256.o.

With the fix sha256.o is rebuilt in the last step.

Without FORCE make does not detect changes only made to the command line
options.  So object files might not be re-built even when they should be.
Fix this by adding FORCE where it is missing.

Link: http://lkml.kernel.org/r/20180704110044.29279-2-prudo@linux.ibm.com
Fixes: df6f2801 ("kernel/kexec_file.c: move purgatories sha256 to common code")
Signed-off-by: NPhilipp Rudo <prudo@linux.ibm.com>
Acked-by: NDave Young <dyoung@redhat.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>	[4.17+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fa8cbda8

13 7月, 2018 10 次提交

x86/bugs, kvm: Introduce boot-time control of L1TF mitigations · d90a7a0e

由 Jiri Kosina 提交于 7月 13, 2018

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
	Provides all available mitigations for the L1TF vulnerability. Disables
	SMT and enables all mitigations in the hypervisors. SMT control via
	/sys/devices/system/cpu/smt/control is still possible after boot.
	Hypervisors will issue a warning when the first VM is started in
	a potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  full,force
	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
	command line option. sysfs control of SMT and the hypervisor flush
	control is disabled.

  flush
	Leaves SMT enabled and enables the conditional hypervisor mitigation.
	Hypervisors will issue a warning when the first VM is started in a
	potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  flush,nosmt
	Disables SMT and enables the conditional hypervisor mitigation. SMT
	control via /sys/devices/system/cpu/smt/control is still possible
	after boot. If SMT is reenabled or flushing disabled at runtime
	hypervisors will issue a warning.

  flush,nowarn
	Same as 'flush', but hypervisors will not warn when
	a VM is started in a potentially insecure configuration.

  off
	Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
    			  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
			  SMT has been runtime enabled or L1D flushing
			  has been run-time enabled
			  
  - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.
  
  - 'l1tf=off'		: L1D flushes are not performed and no warnings
			  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de

d90a7a0e

cpu/hotplug: Set CPU_SMT_NOT_SUPPORTED early · fee0aede

由 Thomas Gleixner 提交于 7月 13, 2018

The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support
SMT) when the sysfs SMT control file is initialized.

That was fine so far as this was only required to make the output of the
control file correct and to prevent writes in that case.

With the upcoming l1tf command line parameter, this needs to be set up
before the L1TF mitigation selection and command line parsing happens.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de

fee0aede

x86/kvm: Allow runtime control of L1D flush · 895ae47f

由 Thomas Gleixner 提交于 7月 13, 2018

All mitigation modes can be switched at run time with a static key now:

 - Use sysfs_streq() instead of strcmp() to handle the trailing new line
   from sysfs writes correctly.
 - Make the static key management handle multiple invocations properly.
 - Set the module parameter file to RW
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.954525119@linutronix.de

895ae47f

x86/kvm: Serialize L1D flush parameter setter · dd4bfa73

由 Thomas Gleixner 提交于 7月 13, 2018

Writes to the parameter files are not serialized at the sysfs core
level, so local serialization is required.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.873642605@linutronix.de

dd4bfa73

x86/kvm: Add static key for flush always · 4c6523ec

由 Thomas Gleixner 提交于 7月 13, 2018

Avoid the conditional in the L1D flush control path.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.790914912@linutronix.de

4c6523ec

x86/kvm: Move l1tf setup function · 7db92e16

由 Thomas Gleixner 提交于 7月 13, 2018

In preparation of allowing run time control for L1D flushing, move the
setup code to the module parameter handler.

In case of pre module init parsing, just store the value and let vmx_init()
do the actual setup after running kvm_init() so that enable_ept is having
the correct state.

During run-time invoke it directly from the parameter setter to prepare for
run-time control.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.694063239@linutronix.de

7db92e16

x86/l1tf: Handle EPT disabled state proper · a7b9020b

由 Thomas Gleixner 提交于 7月 13, 2018

If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.

Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de

a7b9020b

x86/kvm: Drop L1TF MSR list approach · 2f055947

由 Thomas Gleixner 提交于 7月 13, 2018

The VMX module parameter to control the L1D flush should become
writeable.

The MSR list is set up at VM init per guest VCPU, but the run time
switching is based on a static key which is global. Toggling the MSR list
at run time might be feasible, but for now drop this optimization and use
the regular MSR write to make run-time switching possible.

The default mitigation is the conditional flush anyway, so for extra
paranoid setups this will add some small overhead, but the extra code
executed is in the noise compared to the flush itself.

Aside of that the EPT disabled case is not handled correctly at the moment
and the MSR list magic is in the way for fixing that as well.

If it's really providing a significant advantage, then this needs to be
revisited after the code is correct and the control is writable.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.516940445@linutronix.de

2f055947

x86/litf: Introduce vmx status variable · 72c6d2db

由 Thomas Gleixner 提交于 7月 13, 2018

Store the effective mitigation of VMX in a status variable and use it to
report the VMX state in the l1tf sysfs file.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de

72c6d2db

xen: setup pv irq ops vector earlier · 0ce0bba4

由 Juergen Gross 提交于 7月 12, 2018

Setting pv_irq_ops for Xen PV domains should be done as early as
possible in order to support e.g. very early printk() usage.

The same applies to xen_vcpu_info_reset(0), as it is needed for the
pv irq ops.

Move the call of xen_setup_machphys_mapping() after initializing the
pv functions as it contains a WARN_ON(), too.

Remove the no longer necessary conditional in xen_init_irq_ops()
from PVH V1 times to make clear this is a PV only function.

Cc: <stable@vger.kernel.org> # 4.14
Signed-off-by: NJuergen Gross <jgross@suse.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>

0ce0bba4

12 7月, 2018 1 次提交

xen: remove global bit from __default_kernel_pte_mask for pv guests · e69b5d30

由 Juergen Gross 提交于 7月 02, 2018

When removing the global bit from __supported_pte_mask do the same for
__default_kernel_pte_mask in order to avoid the WARN_ONCE() in
check_pgprot() when setting a kernel pte before having called
init_mem_mapping().

Cc: <stable@vger.kernel.org> # 4.17
Reported-by: NMichael Young <m.a.young@durham.ac.uk>
Signed-off-by: NJuergen Gross <jgross@suse.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>

e69b5d30

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功