提交 · d1185a8c5dd21182012e6dd531b00fd72f4d30cb · openeuler / raspberrypi-kernel

01 7月, 2017 2 次提交

x86/intel_rdt: Fix memory leak on mount failure · 79298acc

由 Vikas Shivappa 提交于 6月 26, 2017

If mount fails, the kn_info directory is not freed causing memory leak.

Add the missing error handling path.

Fixes: 4e978d06 ("x86/intel_rdt: Add "info" files to resctrl file system")
Signed-off-by: NVikas Shivappa <vikas.shivappa@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: ravi.v.shankar@intel.com
Cc: tony.luck@intel.com
Cc: fenghua.yu@intel.com
Cc: peterz@infradead.org
Cc: vikas.shivappa@intel.com
Cc: andi.kleen@intel.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1498503368-20173-3-git-send-email-vikas.shivappa@linux.intel.com

79298acc

ARM: Prepare for randomized task_struct · ffa47aa6

由 Arnd Bergmann 提交于 6月 30, 2017

With the new task struct randomization, we can run into a build
failure for certain random seeds, which will place fields beyond
the allow immediate size in the assembly:

arch/arm/kernel/entry-armv.S: Assembler messages:
arch/arm/kernel/entry-armv.S:803: Error: bad immediate value for offset (4096)

Only two constants in asm-offset.h are affected, and I'm changing
both of them here to work correctly in all configurations.

One more macro has the problem, but is currently unused, so this
removes it instead of adding complexity.
Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
[kees: Adjust commit log slightly]
Signed-off-by: NKees Cook <keescook@chromium.org>

ffa47aa6

30 6月, 2017 8 次提交

x86/boot/KASLR: Fix kexec crash due to 'virt_addr' calculation bug · 8eabf42a

由 Baoquan He 提交于 6月 27, 2017

Kernel text KASLR is separated into physical address and virtual
address randomization. And for virtual address randomization, we
only randomiza to get an offset between 16M and KERNEL_IMAGE_SIZE.
So the initial value of 'virt_addr' should be LOAD_PHYSICAL_ADDR,
but not the original kernel loading address 'output'.

The bug will cause kernel boot failure if kernel is loaded at a different
position than the address, 16M, which is decided at compiled time.
Kexec/kdump is such practical case.

To fix it, just assign LOAD_PHYSICAL_ADDR to virt_addr as initial
value.
Tested-by: NDave Young <dyoung@redhat.com>
Signed-off-by: NBaoquan He <bhe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 8391c73c ("x86/KASLR: Randomize virtual address separately")
Link: http://lkml.kernel.org/r/1498567146-11990-3-git-send-email-bhe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

8eabf42a

x86/boot/KASLR: Add checking for the offset of kernel virtual address randomization · b892cb87

由 Baoquan He 提交于 6月 27, 2017

For kernel text KASLR, the virtual address is confined to area of 1G,
[0xffffffff80000000, 0xffffffffc0000000). For the implemenataion of
virtual address randomization, we only randomize to get an offset
between 16M and 1G, then add this offset to the starting address,
0xffffffff80000000. Here 16M is the offset which is decided at linking
stage. So the amount of the local variable 'virt_addr' which respresents
the offset plus the kernel output size can not exceed KERNEL_IMAGE_SIZE.

Add a debug check for the offset. If out of bounds, print error
message and hang there.
Suggested-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NBaoquan He <bhe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1498567146-11990-2-git-send-email-bhe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

b892cb87

MIPS: Avoid accidental raw backtrace · 85423636

由 James Hogan 提交于 6月 29, 2017

Since commit 81a76d71 ("MIPS: Avoid using unwind_stack() with
usermode") show_backtrace() invokes the raw backtracer when
cp0_status & ST0_KSU indicates user mode to fix issues on EVA kernels
where user and kernel address spaces overlap.

However this is used by show_stack() which creates its own pt_regs on
the stack and leaves cp0_status uninitialised in most of the code paths.
This results in the non deterministic use of the raw back tracer
depending on the previous stack content.

show_stack() deals exclusively with kernel mode stacks anyway, so
explicitly initialise regs.cp0_status to KSU_KERNEL (i.e. 0) to ensure
we get a useful backtrace.

Fixes: 81a76d71 ("MIPS: Avoid using unwind_stack() with usermode")
Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: <stable@vger.kernel.org> # 3.15+
Patchwork: https://patchwork.linux-mips.org/patch/16656/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

85423636

MIPS: Perform post-DMA cache flushes on systems with MAARs · cad482c1

由 Paul Burton 提交于 6月 13, 2017

Recent CPUs from Imagination Technologies such as the I6400 or P6600 are
able to speculatively fetch data from memory into caches. This means
that if used in a system with non-coherent DMA they require that caches
be invalidated after a device performs DMA, and before the CPU reads the
DMA'd data, in order to ensure that stale values weren't speculatively
prefetched.

Such CPUs also introduced Memory Accessibility Attribute Registers
(MAARs) in order to control the regions in which they are allowed to
speculate. Thus we can use the presence of MAARs as a good indication
that the CPU requires the above cache maintenance. Use the presence of
MAARs to determine the result of cpu_needs_post_dma_flush() in the
default case, in order to handle these recent CPUs correctly.

Note that the return type of cpu_needs_post_dma_flush() is changed to
bool, such that it's clearer what's happening when cpu_has_maar is cast
to bool for the return value. If this patch were backported to a
pre-v4.7 kernel then MIPS_CPU_MAAR was 1ull<<34, so when cast to an int
we would incorrectly return 0. It so happens that MIPS_CPU_MAAR is
currently 1ull<<30, so when truncated to an int gives a non-zero value
anyway, but even so the implicit conversion from long long int to bool
makes it clearer to understand what will happen than the implicit
conversion from long long int to int would. The bool return type also
fits this usage better semantically, so seems like an all-round win.

Thanks to Ed for spotting the issue for pre-v4.7 kernels & suggesting
the return type change.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Reviewed-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
Tested-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Ed Blake <ed.blake@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16363/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

cad482c1

MIPS: Fix IRQ tracing & lockdep when rescheduling · d8550860

由 Paul Burton 提交于 3月 03, 2017

When the scheduler sets TIF_NEED_RESCHED & we call into the scheduler
from arch/mips/kernel/entry.S we disable interrupts. This is true
regardless of whether we reach work_resched from syscall_exit_work,
resume_userspace or by looping after calling schedule(). Although we
disable interrupts in these paths we don't call trace_hardirqs_off()
before calling into C code which may acquire locks, and we therefore
leave lockdep with an inconsistent view of whether interrupts are
disabled or not when CONFIG_PROVE_LOCKING & CONFIG_DEBUG_LOCKDEP are
both enabled.

Without tracing this interrupt state lockdep will print warnings such
as the following once a task returns from a syscall via
syscall_exit_partial with TIF_NEED_RESCHED set:

[   49.927678] ------------[ cut here ]------------
[   49.934445] WARNING: CPU: 0 PID: 1 at kernel/locking/lockdep.c:3687 check_flags.part.41+0x1dc/0x1e8
[   49.946031] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled)
[   49.946355] CPU: 0 PID: 1 Comm: init Not tainted 4.10.0-00439-gc9fd5d362289-dirty #197
[   49.963505] Stack : 0000000000000000 ffffffff81bb5d6a 0000000000000006 ffffffff801ce9c4
[   49.974431]         0000000000000000 0000000000000000 0000000000000000 000000000000004a
[   49.985300]         ffffffff80b7e487 ffffffff80a24498 a8000000ff160000 ffffffff80ede8b8
[   49.996194]         0000000000000001 0000000000000000 0000000000000000 0000000077c8030c
[   50.007063]         000000007fd8a510 ffffffff801cd45c 0000000000000000 a8000000ff127c88
[   50.017945]         0000000000000000 ffffffff801cf928 0000000000000001 ffffffff80a24498
[   50.028827]         0000000000000000 0000000000000001 0000000000000000 0000000000000000
[   50.039688]         0000000000000000 a8000000ff127bd0 0000000000000000 ffffffff805509bc
[   50.050575]         00000000140084e0 0000000000000000 0000000000000000 0000000000040a00
[   50.061448]         0000000000000000 ffffffff8010e1b0 0000000000000000 ffffffff805509bc
[   50.072327]         ...
[   50.076087] Call Trace:
[   50.079869] [<ffffffff8010e1b0>] show_stack+0x80/0xa8
[   50.086577] [<ffffffff805509bc>] dump_stack+0x10c/0x190
[   50.093498] [<ffffffff8015dde0>] __warn+0xf0/0x108
[   50.099889] [<ffffffff8015de34>] warn_slowpath_fmt+0x3c/0x48
[   50.107241] [<ffffffff801c15b4>] check_flags.part.41+0x1dc/0x1e8
[   50.114961] [<ffffffff801c239c>] lock_is_held_type+0x8c/0xb0
[   50.122291] [<ffffffff809461b8>] __schedule+0x8c0/0x10f8
[   50.129221] [<ffffffff80946a60>] schedule+0x30/0x98
[   50.135659] [<ffffffff80106278>] work_resched+0x8/0x34
[   50.142397] ---[ end trace 0cb4f6ef5b99fe21 ]---
[   50.148405] possible reason: unannotated irqs-off.
[   50.154600] irq event stamp: 400463
[   50.159566] hardirqs last  enabled at (400463): [<ffffffff8094edc8>] _raw_spin_unlock_irqrestore+0x40/0xa8
[   50.171981] hardirqs last disabled at (400462): [<ffffffff8094eb98>] _raw_spin_lock_irqsave+0x30/0xb0
[   50.183897] softirqs last  enabled at (400450): [<ffffffff8016580c>] __do_softirq+0x4ac/0x6a8
[   50.195015] softirqs last disabled at (400425): [<ffffffff80165e78>] irq_exit+0x110/0x128

Fix this by using the TRACE_IRQS_OFF macro to call trace_hardirqs_off()
when CONFIG_TRACE_IRQFLAGS is enabled. This is done before invoking
schedule() following the work_resched label because:

 1) Interrupts are disabled regardless of the path we take to reach
    work_resched() & schedule().

 2) Performing the tracing here avoids the need to do it in paths which
    disable interrupts but don't call out to C code before hitting a
    path which uses the RESTORE_SOME macro that will call
    trace_hardirqs_on() or trace_hardirqs_off() as appropriate.

We call trace_hardirqs_on() using the TRACE_IRQS_ON macro before calling
syscall_trace_leave() for similar reasons, ensuring that lockdep has a
consistent view of state after we re-enable interrupts.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org>
Patchwork: https://patchwork.linux-mips.org/patch/15385/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

d8550860

MIPS: pm-cps: Drop manual cache-line alignment of ready_count · 161c51cc

由 Paul Burton 提交于 3月 02, 2017

We allocate memory for a ready_count variable per-CPU, which is accessed
via a cached non-coherent TLB mapping to perform synchronisation between
threads within the core using LL/SC instructions. In order to ensure
that the variable is contained within its own data cache line we
allocate 2 lines worth of memory & align the resulting pointer to a line
boundary. This is however unnecessary, since kmalloc is guaranteed to
return memory which is at least cache-line aligned (see
ARCH_DMA_MINALIGN). Stop the redundant manual alignment.

Besides cleaning up the code & avoiding needless work, this has the side
effect of avoiding an arithmetic error found by Bryan on 64 bit systems
due to the 32 bit size of the former dlinesz. This led the ready_count
variable to have its upper 32b cleared erroneously for MIPS64 kernels,
causing problems when ready_count was later used on MIPS64 via cpuidle.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: 3179d37e ("MIPS: pm-cps: add PM state entry code for CPS systems")
Reported-by: NBryan O'Donoghue <bryan.odonoghue@imgtec.com>
Reviewed-by: NBryan O'Donoghue <bryan.odonoghue@imgtec.com>
Tested-by: NBryan O'Donoghue <bryan.odonoghue@imgtec.com>
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org> # v3.16+
Patchwork: https://patchwork.linux-mips.org/patch/15383/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

161c51cc

ARM: 8685/1: ensure memblock-limit is pmd-aligned · 9e25ebfe

由 Doug Berger 提交于 6月 29, 2017

The pmd containing memblock_limit is cleared by prepare_page_table()
which creates the opportunity for early_alloc() to allocate unmapped
memory if memblock_limit is not pmd aligned causing a boot-time hang.

Commit 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
attempted to resolve this problem, but there is a path through the
adjust_lowmem_bounds() routine where if all memory regions start and
end on pmd-aligned addresses the memblock_limit will be set to
arm_lowmem_limit.

Since arm_lowmem_limit can be affected by the vmalloc early parameter,
the value of arm_lowmem_limit may not be pmd-aligned. This commit
corrects this oversight such that memblock_limit is always rounded
down to pmd-alignment.

Fixes: 965278dc ("ARM: 8356/1: mm: handle non-pmd-aligned end of RAM")
Signed-off-by: NDoug Berger <opendmb@gmail.com>
Suggested-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>

9e25ebfe

perf/x86/intel/uncore: Fix wrong box pointer check · 80c65fdb

由 Kan Liang 提交于 6月 29, 2017

Should not init a NULL box. It will cause system crash.
The issue looks like caused by a typo.

This was not noticed because there is no NULL box. Also, for most
boxes, they are enabled by default. The init code is not critical.

Fixes: fff4b87e ("perf/x86/intel/uncore: Make package handling more robust")
Signed-off-by: NKan Liang <kan.liang@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170629190926.2456-1-kan.liang@intel.com

80c65fdb

29 6月, 2017 1 次提交

arch: remove unused macro/function thread_saved_pc() · 6474924e

由 Tobias Klauser 提交于 6月 28, 2017

The only user of thread_saved_pc() in non-arch-specific code was removed
in commit 8243d559 ("sched/core: Remove pointless printout in
sched_show_task()").  Remove the implementations as well.

Some architectures use thread_saved_pc() in their arch-specific code.
Leave their thread_saved_pc() intact.
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6474924e

28 6月, 2017 2 次提交

MIPS: math-emu: Handle zero accumulator case in MADDF and MSUBF separately · ddbfff74

由 Aleksandar Markovic 提交于 6月 19, 2017

If accumulator value is zero, just return the value of previously
calculated product. This brings logic in MADDF/MSUBF implementation
closer to the logic in ADD/SUB case.
Signed-off-by: NMiodrag Dinic <miodrag.dinic@imgtec.com>
Signed-off-by: NGoran Ferenc <goran.ferenc@imgtec.com>
Signed-off-by: NAleksandar Markovic <aleksandar.markovic@imgtec.com>
Cc: James.Hogan@imgtec.com
Cc: Paul.Burton@imgtec.com
Cc: Raghu.Gandham@imgtec.com
Cc: Leonid.Yegoshin@imgtec.com
Cc: Douglas.Leung@imgtec.com
Cc: Petar.Jovanovic@imgtec.com
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16512/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

ddbfff74

MIPS: head: Reorder instructions missing a delay slot · 25d8b92e

由 Karl Beldan 提交于 6月 27, 2017

In this sequence the 'move' is assumed in the delay slot of the 'beq',
but head.S is in reorder mode and the former gets pushed one 'nop'
farther by the assembler.

The corrected behavior made booting with an UHI supplied dtb erratic.

Fixes: 15f37e15 ("MIPS: store the appended dtb address in a variable")
Signed-off-by: NKarl Beldan <karl.beldan+oss@gmail.com>
Reviewed-by: NJames Hogan <james.hogan@imgtec.com>
Cc: Jonas Gorski <jogo@openwrt.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/16614/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

25d8b92e

26 6月, 2017 2 次提交

powerpc/32: Avoid miscompilation w/GCC 4.6.3 - don't inline copy_to/from_user() · d6bd8194

由 Michael Ellerman 提交于 6月 26, 2017

Larry Finger reported that his Powerbook G4 was no longer booting with v4.12-rc,
userspace was up but giving weird errors such as:

  udevd[64]: starting version 175
  udevd[64]: Unable to receive ctrl message: Bad address.
  modprobe: chdir(4.12-rc1): No such file or directory

He bisected the problem to commit 3448890c ("powerpc: get rid of zeroing,
switch to RAW_COPY_USER").

Al identified that the problem is actually a miscompilation by GCC 4.6.3, which
is exposed by the above commit.

Al also pointed out that inlining copy_to/from_user() is probably of little or
no benefit, which is correct. Using Anton's copy_to_user benchmark, with a
pathological single byte copy, we see a small increase in performance
by *removing* inlining:

  Before (inlined):
  # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
  real	0m22.063s
  real	0m22.059s
  real	0m22.076s

  After:
  # time ./copy_to_user -w -l 1 -i 10000000	( x 3 )
  real	0m21.325s
  real	0m21.299s
  real	0m21.364s

So as a small performance improvement and to avoid the miscompilation, drop
inlining copy_to/from_user() on 32-bit.

Fixes: 3448890c ("powerpc: get rid of zeroing, switch to RAW_COPY_USER")
Reported-by: NLarry Finger <Larry.Finger@lwfinger.net>
Suggested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d6bd8194

x86/mm/hotplug: Fix BUG_ON() after hot-remove by not freeing PUD · 98fe3633

由 Jérôme Glisse 提交于 6月 24, 2017

Since commit:

  af2cf278 ("x86/mm/hotplug: Don't remove PGD entries in remove_pagetable()")

we no longer free PUDs so that we do not have to synchronize
all PGDs on hot-remove/vfree().

But the new 5-level page table patchset reverted that for 4-level
page tables, in the following commit:

  f2a6a705: ("x86: Convert the rest of the code to support p4d_t")

This patch restores the damage and disables free_pud() if we are in the
4-level page table case, thus avoiding BUG_ON() after hot-remove.
Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
[ Clarified the changelog and the code comments. ]
Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20170624180514.3821-1-jglisse@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

98fe3633

24 6月, 2017 1 次提交

x86/mshyperv: Remove excess #includes from mshyperv.h · 26fcd952

由 Thomas Gleixner 提交于 6月 23, 2017

A recent commit included linux/slab.h in linux/irq.h. This breaks the build
of vdso32 on a 64-bit kernel.

The reason is that linux/irq.h gets included into the vdso code via
linux/interrupt.h which is included from asm/mshyperv.h. That makes the
32-bit vdso compile fail, because slab.h includes the pgtable headers for
64-bit on a 64-bit build.

Neither linux/clocksource.h nor linux/interrupt.h are needed in the
mshyperv.h header file itself - it has a dependency on <linux/atomic.h>.

Remove the includes and unbreak the build.
Reported-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: devel@linuxdriverproject.org
Fixes: dee863b5 ("hv: export current Hyper-V clocksource")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1706231038460.2647@nanosSigned-off-by: NIngo Molnar <mingo@kernel.org>

26fcd952

23 6月, 2017 2 次提交

powerpc/64: Initialise thread_info for emergency stacks · 34f19ff1

由 Nicholas Piggin 提交于 6月 21, 2017

Emergency stacks have their thread_info mostly uninitialised, which in
particular means garbage preempt_count values.

Emergency stack code runs with interrupts disabled entirely, and is
used very rarely, so this has been unnoticed so far. It was found by a
proposed new powerpc watchdog that takes a soft-NMI directly from the
masked_interrupt handler and using the emergency stack. That crashed
at BUG_ON(in_nmi()) in nmi_enter(). preempt_count()s were found to be
garbage.

To fix this, zero the entire THREAD_SIZE allocation, and initialize
the thread_info.

Cc: stable@vger.kernel.org
Reported-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com>
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
[mpe: Move it all into setup_64.c, use a function not a macro. Fix
      crashes on Cell by setting preempt_count to 0 not HARDIRQ_OFFSET]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

34f19ff1

gcc-plugins: Add the randstruct plugin · 313dd1b6

由 Kees Cook 提交于 5月 05, 2017

This randstruct plugin is modified from Brad Spengler/PaX Team's code
in the last public patch of grsecurity/PaX based on my understanding
of the code. Changes or omissions from the original code are mine and
don't reflect the original grsecurity/PaX code.

The randstruct GCC plugin randomizes the layout of selected structures
at compile time, as a probabilistic defense against attacks that need to
know the layout of structures within the kernel. This is most useful for
"in-house" kernel builds where neither the randomization seed nor other
build artifacts are made available to an attacker. While less useful for
distribution kernels (where the randomization seed must be exposed for
third party kernel module builds), it still has some value there since now
all kernel builds would need to be tracked by an attacker.

In more performance sensitive scenarios, GCC_PLUGIN_RANDSTRUCT_PERFORMANCE
can be selected to make a best effort to restrict randomization to
cacheline-sized groups of elements, and will not randomize bitfields. This
comes at the cost of reduced randomization.

Two annotations are defined,__randomize_layout and __no_randomize_layout,
which respectively tell the plugin to either randomize or not to
randomize instances of the struct in question. Follow-on patches enable
the auto-detection logic for selecting structures for randomization
that contain only function pointers. It is disabled here to assist with
bisection.

Since any randomized structs must be initialized using designated
initializers, __randomize_layout includes the __designated_init annotation
even when the plugin is disabled so that all builds will require
the needed initialization. (With the plugin enabled, annotations for
automatically chosen structures are marked as well.)

The main differences between this implemenation and grsecurity are:
- disable automatic struct selection (to be enabled in follow-up patch)
- add designated_init attribute at runtime and for manual marking
- clarify debugging output to differentiate bad cast warnings
- add whitelisting infrastructure
- support gcc 7's DECL_ALIGN and DECL_MODE changes (Laura Abbott)
- raise minimum required GCC version to 4.7

Earlier versions of this patch series were ported by Michael Leibowitz.
Signed-off-by: NKees Cook <keescook@chromium.org>

313dd1b6

22 6月, 2017 4 次提交

KVM: x86: fix singlestepping over syscall · c8401dda

由 Paolo Bonzini 提交于 6月 07, 2017

TF is handled a bit differently for syscall and sysret, compared
to the other instructions: TF is checked after the instruction completes,
so that the OS can disable #DB at a syscall by adding TF to FMASK.
When the sysret is executed the #DB is taken "as if" the syscall insn
just completed.

KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
Fix the behavior, otherwise you could get #DB on a user stack which is not
nice.  This does not affect Linux guests, as they use an IST or task gate
for #DB.

This fixes CVE-2017-7518.

Cc: stable@vger.kernel.org
Reported-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

c8401dda

powerpc/powernv/npu-dma: Add explicit flush when sending an ATSD · bbd5ff50

由 Alistair Popple 提交于 6月 20, 2017

NPU2 requires an extra explicit flush to an active GPU PID when
sending address translation shoot downs (ATSDs) to reliably flush the
GPU TLB. This patch adds just such a flush at the end of each sequence
of ATSDs.

We can safely use PID 0 which is always reserved and active on the
GPU. PID 0 is only used for init_mm which will never be a user mm on
the GPU. To enforce this we add a check in pnv_npu2_init_context()
just in case someone tries to use PID 0 on the GPU.
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
[mpe: Use true/false for bool literals]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bbd5ff50

KVM: s390: gaccess: fix real-space designation asce handling for gmap shadows · addb63c1

由 Heiko Carstens 提交于 6月 19, 2017

For real-space designation asces the asce origin part is only a token.
The asce token origin must not be used to generate an effective
address for storage references. This however is erroneously done
within kvm_s390_shadow_tables().

Furthermore within the same function the wrong parts of virtual
addresses are used to generate a corresponding real address
(e.g. the region second index is used as region first index).

Both of the above can result in incorrect address translations. Only
for real space designations with a token origin of zero and addresses
below one megabyte the translation was correct.

Furthermore replace a "!asce.r" statement with a "!*fake" statement to
make it more obvious that a specific condition has nothing to do with
the architecture, but with the fake handling of real space designations.

Fixes: 3218f709 ("s390/mm: support real-space for gmap shadows")
Cc: David Hildenbrand <david@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

addb63c1

perf/x86/intel: Add 1G DTLB load/store miss support for SKL · fb3a5055

由 Kan Liang 提交于 6月 19, 2017

Current DTLB load/store miss events (0x608/0x649) only counts 4K,2M and
4M page size.
Need to extend the events to support any page size (4K/2M/4M/1G).

The complete DTLB load/store miss events are:

  DTLB_LOAD_MISSES.WALK_COMPLETED		0xe08
  DTLB_STORE_MISSES.WALK_COMPLETED		0xe49
Signed-off-by: NKan Liang <Kan.liang@intel.com>
Cc: <stable@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/20170619142609.11058-1-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

fb3a5055

20 6月, 2017 3 次提交

KVM: MIPS: Fix maybe-uninitialized build failure · e27a9eca

由 James Cowgill 提交于 6月 20, 2017

This commit fixes a "maybe-uninitialized" build failure in
arch/mips/kvm/tlb.c when KVM, DYNAMIC_DEBUG and JUMP_LABEL are all
enabled. The failure is:

In file included from ./include/linux/printk.h:329:0,
                 from ./include/linux/kernel.h:13,
                 from ./include/asm-generic/bug.h:15,
                 from ./arch/mips/include/asm/bug.h:41,
                 from ./include/linux/bug.h:4,
                 from ./include/linux/thread_info.h:11,
                 from ./include/asm-generic/current.h:4,
                 from ./arch/mips/include/generated/asm/current.h:1,
                 from ./include/linux/sched.h:11,
                 from arch/mips/kvm/tlb.c:13:
arch/mips/kvm/tlb.c: In function ‘kvm_mips_host_tlb_inv’:
./include/linux/dynamic_debug.h:126:3: error: ‘idx_kernel’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
   __dynamic_pr_debug(&descriptor, pr_fmt(fmt), \
   ^~~~~~~~~~~~~~~~~~
arch/mips/kvm/tlb.c:169:16: note: ‘idx_kernel’ was declared here
  int idx_user, idx_kernel;
                ^~~~~~~~~~

There is a similar error relating to "idx_user". Both errors were
observed with GCC 6.

As far as I can tell, it is impossible for either idx_user or idx_kernel
to be uninitialized when they are later read in the calls to kvm_debug,
but to satisfy the compiler, add zero initializers to both variables.
Signed-off-by: NJames Cowgill <James.Cowgill@imgtec.com>
Fixes: 57e3869c ("KVM: MIPS/TLB: Generalise host TLB invalidate to kernel ASID")
Cc: <stable@vger.kernel.org> # 4.11+
Acked-by: NJames Hogan <james.hogan@imgtec.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

e27a9eca

arm64/vdso: Fix nsec handling for CLOCK_MONOTONIC_RAW · dbb236c1

由 Will Deacon 提交于 6月 08, 2017

Recently vDSO support for CLOCK_MONOTONIC_RAW was added in
49eea433 ("arm64: Add support for CLOCK_MONOTONIC_RAW in
clock_gettime() vDSO"). Noticing that the core timekeeping code
never set tkr_raw.xtime_nsec, the vDSO implementation didn't
bother exposing it via the data page and instead took the
unshifted tk->raw_time.tv_nsec value which was then immediately
shifted left in the vDSO code.

Unfortunately, by accellerating the MONOTONIC_RAW clockid, it
uncovered potential 1ns time inconsistencies caused by the
timekeeping core not handing sub-ns resolution.

Now that the core code has been fixed and is actually setting
tkr_raw.xtime_nsec, we need to take that into account in the
vDSO by adding it to the shifted raw_time value, in order to
fix the user-visible inconsistency. Rather than do that at each
use (and expand the data page in the process), instead perform
the shift/addition operation when populating the data page and
remove the shift from the vDSO code entirely.

[jstultz: minor whitespace tweak, tried to improve commit
 message to make it more clear this fixes a regression]
Reported-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Tested-by: NDaniel Mentz <danielmentz@google.com>
Acked-by: NKevin Brodsky <kevin.brodsky@arm.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Stephen Boyd <stephen.boyd@linaro.org>
Cc: "stable #4 . 8+" <stable@vger.kernel.org>
Cc: Miroslav Lichvar <mlichvar@redhat.com>
Link: http://lkml.kernel.org/r/1496965462-20003-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

dbb236c1

Fix English in description of GCC_PLUGIN_STRUCTLEAK · f136e090

由 Jean Delvare 提交于 4月 24, 2017

Signed-off-by: NJean Delvare <jdelvare@suse.de>
Fixes: c61f13ea ("gcc-plugins: Add structleak for more stack initialization")
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>

f136e090

19 6月, 2017 1 次提交

mm: larger stack guard gap, between vmas · 1be7107f

由 Hugh Dickins 提交于 6月 19, 2017

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: NOleg Nesterov <oleg@redhat.com>
Original-patch-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1be7107f

16 6月, 2017 8 次提交

powerpc/perf: Fix oops when kthread execs user process · bf05fc25

由 Ravi Bangoria 提交于 6月 15, 2017

When a kthread calls call_usermodehelper() the steps are:
  1. allocate current->mm
  2. load_elf_binary()
  3. populate current->thread.regs

While doing this, interrupts are not disabled. If there is a perf
interrupt in the middle of this process (i.e. step 1 has completed
but not yet reached to step 3) and if perf tries to read userspace
regs, kernel oops with following log:

  Unable to handle kernel paging request for data at address 0x00000000
  Faulting instruction address: 0xc0000000000da0fc
  ...
  Call Trace:
  perf_output_sample_regs+0x6c/0xd0
  perf_output_sample+0x4e4/0x830
  perf_event_output_forward+0x64/0x90
  __perf_event_overflow+0x8c/0x1e0
  record_and_restart+0x220/0x5c0
  perf_event_interrupt+0x2d8/0x4d0
  performance_monitor_exception+0x54/0x70
  performance_monitor_common+0x158/0x160
  --- interrupt: f01 at avtab_search_node+0x150/0x1a0
      LR = avtab_search_node+0x100/0x1a0
  ...
  load_elf_binary+0x6e8/0x15a0
  search_binary_handler+0xe8/0x290
  do_execveat_common.isra.14+0x5f4/0x840
  call_usermodehelper_exec_async+0x170/0x210
  ret_from_kernel_thread+0x5c/0x7c

Fix it by setting abi to PERF_SAMPLE_REGS_ABI_NONE when userspace
pt_regs are not set.

Fixes: ed4a4ef8 ("powerpc/perf: Add support for sampling interrupt register state")
Cc: stable@vger.kernel.org # v4.7+
Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.vnet.ibm.com>
Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bf05fc25

powerpc/64s: Handle data breakpoints in Radix mode · d89ba535

由 Naveen N. Rao 提交于 6月 14, 2017

On Power9, trying to use data breakpoints throws the splat shown
below. This is because the check for a data breakpoint in DSISR is in
do_hash_page(), which is not called when in Radix mode.

  Unable to handle kernel paging request for data at address 0xc000000000e19218
  Faulting instruction address: 0xc0000000001155e8
  cpu 0x0: Vector: 300 (Data Access) at [c0000000ef1e7b20]
  pc: c0000000001155e8: find_pid_ns+0x48/0xe0
  lr: c000000000116ac4: find_task_by_vpid+0x44/0x90
  sp: c0000000ef1e7da0
  msr: 9000000000009033
  dar: c000000000e19218
  dsisr: 400000

Move the check to handle_page_fault() so as to catch data breakpoints
in both Hash and Radix MMU modes.

We have to change the check in do_hash_page() against 0xa410 to use
0xa450, so as to include the value of (DSISR_DABRMATCH << 16).

There are two sites that call handle_page_fault() when in Radix, both
already pass DSISR in r4.

Fixes: caca285e ("powerpc/mm/radix: Use STD_MMU_64 to properly isolate hash related code")
Cc: stable@vger.kernel.org # v4.7+
Reported-by: NShriya R. Kulkarni <shriykul@in.ibm.com>
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
[mpe: Fix the fall-through case on hash, we need to reload DSISR]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d89ba535

powerpc/kprobes: Skip livepatch_handler() for jprobes · c05b8c44

由 Naveen N. Rao 提交于 6月 01, 2017

ftrace_caller() depends on a modified regs->nip to detect if a certain
function has been livepatched. However, with KPROBES_ON_FTRACE, it is
possible for regs->nip to have been modified by the kprobes pre_handler
(jprobes, for instance). In this case, we do not want to invoke the
livepatch_handler so as not to consume the livepatch stack.

To distinguish between the two (kprobes and livepatch), we check if
there is an active kprobe on the current function. If there is, then we
know for sure that it must have modified the NIP as we don't support
livepatching a kprobe'd function. In this case, we simply skip the
livepatch_handler and branch to the new NIP. Otherwise, the
livepatch_handler is invoked.

Fixes: ead514d5 ("powerpc/kprobes: Add support for KPROBES_ON_FTRACE")
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c05b8c44

powerpc/ftrace: Pass the correct stack pointer for DYNAMIC_FTRACE_WITH_REGS · a4979a7e

由 Naveen N. Rao 提交于 6月 01, 2017

For DYNAMIC_FTRACE_WITH_REGS, we should be passing-in the original set
of registers in pt_regs, to capture the state _before_ ftrace_caller.
However, we are instead passing the stack pointer *after* allocating a
stack frame in ftrace_caller. Fix this by saving the proper value of r1
in pt_regs. Also, use SAVE_10GPRS() to simplify the code.

Fixes: 15308664 ("powerpc/ftrace: Add support for -mprofile-kernel ftrace ABI")
Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a4979a7e

powerpc/kprobes: Pause function_graph tracing during jprobes handling · a9f8553e

由 Naveen N. Rao 提交于 6月 01, 2017

This fixes a crash when function_graph and jprobes are used together.
This is essentially commit 237d28db ("ftrace/jprobes/x86: Fix
conflict between jprobes and function graph tracing"), but for powerpc.

Jprobes breaks function_graph tracing since the jprobe hook needs to use
jprobe_return(), which never returns back to the hook, but instead to
the original jprobe'd function. The solution is to momentarily pause
function_graph tracing before invoking the jprobe hook and re-enable it
when returning back to the original jprobe'd function.

Fixes: 6794c782 ("powerpc64: port of the function graph tracer")
Cc: stable@vger.kernel.org # v2.6.30+
Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a9f8553e

powerpc/debug: Add missing warn flag to WARN_ON's non-builtin path · a093c92d

由 Alexey Kardashevskiy 提交于 6月 14, 2017

When trapped on WARN_ON(), report_bug() is expected to return
BUG_TRAP_TYPE_WARN so the caller will increment NIP by 4 and continue.
The __builtin_constant_p() path of the PPC's WARN_ON()
calls (indirectly) __WARN_FLAGS() which has BUGFLAG_WARNING set,
however the other branch does not which makes report_bug() report a
bug rather than a warning.

Fixes: f26dee15 ("debug: Avoid setting BUGFLAG_WARNING twice")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a093c92d

KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1 · 3d3efb68

由 Paul Mackerras 提交于 6月 06, 2017

POWER9 DD1 has an erratum where writing to the TBU40 register, which
is used to apply an offset to the timebase, can cause the timebase to
lose counts.  This results in the timebase on some CPUs getting out of
sync with other CPUs, which then results in misbehaviour of the
timekeeping code.

To work around the problem, we make KVM ignore the timebase offset for
all guests on POWER9 DD1 machines.  This means that live migration
cannot be supported on POWER9 DD1 machines.

Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

3d3efb68

KVM: PPC: Book3S HV: Save/restore host values of debug registers · 7ceaa6dc

由 Paul Mackerras 提交于 6月 16, 2017

At present, HV KVM on POWER8 and POWER9 machines loses any instruction
or data breakpoint set in the host whenever a guest is run.
Instruction breakpoints are currently only used by xmon, but ptrace
and the perf_event subsystem can set data breakpoints as well as xmon.

To fix this, we save the host values of the debug registers (CIABR,
DAWR and DAWRX) before entering the guest and restore them on exit.
To provide space to save them in the stack frame, we expand the stack
frame allocated by kvmppc_hv_entry() from 112 to 144 bytes.

Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

7ceaa6dc

15 6月, 2017 6 次提交

iommu/vt-d: Correctly disable Intel IOMMU force on · 7304e8f2

由 Shaohua Li 提交于 6月 08, 2017

I made a mistake in commit bfd20f1c. We should skip the force on with the
option enabled instead of vice versa. Not sure why this passed our
performance test, sorry.

Fixes: bfd20f1c ('x86, iommu/vt-d: Add an option to disable Intel IOMMU force on')
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>

7304e8f2

powerpc/xive: Fix offset for store EOI MMIOs · 25642705

由 Benjamin Herrenschmidt 提交于 6月 14, 2017

Architecturally we should apply a 0x400 offset for these. Not doing
it will break future HW implementations.

The offset of 0 is supposed to remain for "triggers" though not all
sources support both trigger and store EOI, and in P9 specifically,
some sources will treat 0 as a store EOI. But future chips will not.
So this makes us use the properly architected offset which should work
always.

Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

25642705

MIPS: .its targets depend on vmlinux · bcd7c45e

由 Paul Burton 提交于 6月 02, 2017

The .its targets require information about the kernel binary, such as
its entry point, which is extracted from the vmlinux ELF. We therefore
require that the ELF is built before the .its files are generated.
Declare this requirement in the Makefile such that make will ensure this
is always the case, otherwise in corner cases we can hit issues as the
.its is generated with an incorrect (either invalid or stale) entry
point.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: cf2a5e0b ("MIPS: Support generating Flattened Image Trees (.itb)")
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org> # v4.9+
Patchwork: https://patchwork.linux-mips.org/patch/16179/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

bcd7c45e

MIPS: Fix bnezc/jialc return address calculation · 1a73d931

由 Paul Burton 提交于 6月 02, 2017

The code handling the pop76 opcode (ie. bnezc & jialc instructions) in
__compute_return_epc_for_insn() needs to set the value of $31 in the
jialc case, which is encoded with rs = 0. However its check to
differentiate bnezc (rs != 0) from jialc (rs = 0) was unfortunately
backwards, meaning that if we emulate a bnezc instruction we clobber $31
& if we emulate a jialc instruction it actually behaves like a jic
instruction.

Fix this by inverting the check of rs to match the way the instructions
are actually encoded.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: 28d6f93d ("MIPS: Emulate the new MIPS R6 BNEZC and JIALC instructions")
Cc: stable <stable@vger.kernel.org> # v4.0+
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16178/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

1a73d931

KVM: PPC: Book3S HV: Preserve userspace HTM state properly · 46a704f8

由 Paul Mackerras 提交于 6月 15, 2017

If userspace attempts to call the KVM_RUN ioctl when it has hardware
transactional memory (HTM) enabled, the values that it has put in the
HTM-related SPRs TFHAR, TFIAR and TEXASR will get overwritten by
guest values.  To fix this, we detect this condition and save those
SPR values in the thread struct, and disable HTM for the task.  If
userspace goes to access those SPRs or the HTM facility in future,
a TM-unavailable interrupt will occur and the handler will reload
those SPRs and re-enable HTM.

If userspace has started a transaction and suspended it, we would
currently lose the transactional state in the guest entry path and
would almost certainly get a "TM Bad Thing" interrupt, which would
cause the host to crash.  To avoid this, we detect this case and
return from the KVM_RUN ioctl with an EINVAL error, with the KVM
exit reason set to KVM_EXIT_FAIL_ENTRY.

Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

46a704f8

KVM: PPC: Book3S HV: Restore critical SPRs to host values on guest exit · 4c3bb4cc

由 Paul Mackerras 提交于 6月 15, 2017

This restores several special-purpose registers (SPRs) to sane values
on guest exit that were missed before.

TAR and VRSAVE are readable and writable by userspace, and we need to
save and restore them to prevent the guest from potentially affecting
userspace execution (not that TAR or VRSAVE are used by any known
program that run uses the KVM_RUN ioctl). We save/restore these
in kvmppc_vcpu_run_hv() rather than on every guest entry/exit.

FSCR affects userspace execution in that it can prohibit access to
certain facilities by userspace. We restore it to the normal value
for the task on exit from the KVM_RUN ioctl.

IAMR is normally 0, and is restored to 0 on guest exit. However,
with a radix host on POWER9, it is set to a value that prevents the
kernel from executing user-accessible memory. On POWER9, we save
IAMR on guest entry and restore it on guest exit to the saved value
rather than 0. On POWER8 we continue to set it to 0 on guest exit.

PSPB is normally 0. We restore it to 0 on guest exit to prevent
userspace taking advantage of the guest having set it non-zero
(which would allow userspace to set its SMT priority to high).

UAMOR is normally 0. We restore it to 0 on guest exit to prevent
the AMR from being used as a covert channel between userspace
processes, since the AMR is not context-switched at present.

Fixes: b005255e ("KVM: PPC: Book3S HV: Context-switch new POWER8 SPRs", 2014-01-08)
Cc: stable@vger.kernel.org # v3.14+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

4c3bb4cc