提交 · f9e169883164390a15b56d00cb7e22c2e72f4dba · openeuler / raspberrypi-kernel

19 6月, 2017 1 次提交

mm: larger stack guard gap, between vmas · 1be7107f

由 Hugh Dickins 提交于 6月 19, 2017

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.
Original-patch-by: NOleg Nesterov <oleg@redhat.com>
Original-patch-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1be7107f

16 6月, 2017 1 次提交

powerpc/debug: Add missing warn flag to WARN_ON's non-builtin path · a093c92d

由 Alexey Kardashevskiy 提交于 6月 14, 2017

When trapped on WARN_ON(), report_bug() is expected to return
BUG_TRAP_TYPE_WARN so the caller will increment NIP by 4 and continue.
The __builtin_constant_p() path of the PPC's WARN_ON()
calls (indirectly) __WARN_FLAGS() which has BUGFLAG_WARNING set,
however the other branch does not which makes report_bug() report a
bug rather than a warning.

Fixes: f26dee15 ("debug: Avoid setting BUGFLAG_WARNING twice")
Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a093c92d

15 6月, 2017 3 次提交

powerpc/xive: Fix offset for store EOI MMIOs · 25642705

由 Benjamin Herrenschmidt 提交于 6月 14, 2017

Architecturally we should apply a 0x400 offset for these. Not doing
it will break future HW implementations.

The offset of 0 is supposed to remain for "triggers" though not all
sources support both trigger and store EOI, and in P9 specifically,
some sources will treat 0 as a store EOI. But future chips will not.
So this makes us use the properly architected offset which should work
always.

Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

25642705

MIPS: .its targets depend on vmlinux · bcd7c45e

由 Paul Burton 提交于 6月 02, 2017

The .its targets require information about the kernel binary, such as
its entry point, which is extracted from the vmlinux ELF. We therefore
require that the ELF is built before the .its files are generated.
Declare this requirement in the Makefile such that make will ensure this
is always the case, otherwise in corner cases we can hit issues as the
.its is generated with an incorrect (either invalid or stale) entry
point.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: cf2a5e0b ("MIPS: Support generating Flattened Image Trees (.itb)")
Cc: linux-mips@linux-mips.org
Cc: stable <stable@vger.kernel.org> # v4.9+
Patchwork: https://patchwork.linux-mips.org/patch/16179/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

bcd7c45e

MIPS: Fix bnezc/jialc return address calculation · 1a73d931

由 Paul Burton 提交于 6月 02, 2017

The code handling the pop76 opcode (ie. bnezc & jialc instructions) in
__compute_return_epc_for_insn() needs to set the value of $31 in the
jialc case, which is encoded with rs = 0. However its check to
differentiate bnezc (rs != 0) from jialc (rs = 0) was unfortunately
backwards, meaning that if we emulate a bnezc instruction we clobber $31
& if we emulate a jialc instruction it actually behaves like a jic
instruction.

Fix this by inverting the check of rs to match the way the instructions
are actually encoded.
Signed-off-by: NPaul Burton <paul.burton@imgtec.com>
Fixes: 28d6f93d ("MIPS: Emulate the new MIPS R6 BNEZC and JIALC instructions")
Cc: stable <stable@vger.kernel.org> # v4.0+
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16178/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

1a73d931

14 6月, 2017 1 次提交

powerpc/npu-dma: Remove spurious WARN_ON when a PCI device has no of_node · 377aa6b0

由 Alistair Popple 提交于 6月 14, 2017

Commit 4c3b89ef ("powerpc/powernv: Add sanity checks to
pnv_pci_get_{gpu|npu}_dev") introduced explicit warnings in
pnv_pci_get_npu_dev() when a PCIe device has no associated device-tree
node. However not all PCIe devices have an of_node and
pnv_pci_get_npu_dev() gets indirectly called at least once for every
PCIe device in the system. This results in spurious WARN_ON()'s so
remove it.

The same situation should not exist for pnv_pci_get_gpu_dev() as any
NPU based PCIe device requires a device-tree node.

Fixes: 4c3b89ef ("powerpc/powernv: Add sanity checks to pnv_pci_get_{gpu|npu}_dev")
Reported-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

377aa6b0

13 6月, 2017 2 次提交

x86/mm: Disable 1GB direct mappings when disabling 2MB mappings · d9ee35ac

由 Vlastimil Babka 提交于 6月 12, 2017

The kmemleak and debug_pagealloc features both disable using huge pages for
direct mappings so they can do cpa() on page level granularity in any context.

However they only do that for 2MB pages, which means 1GB pages can still be
used if the CPU supports it, unless disabled by a boot param, which is
non-obvious. Disable also 1GB pages when disabling 2MB pages.
Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/2be70c78-6130-855d-3dfa-d87bd1dd4fda@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>

d9ee35ac

x86/debug: Handle early WARN_ONs proper · 8a524f80

由 Peter Zijlstra 提交于 6月 12, 2017

Hans managed to trigger a WARN very early in the boot which killed his
(Virtual) box.

The reason is that the recent rework of WARN() to use UD0 forgot to add the
fixup_bug() call to early_fixup_exception(). As a result the kernel does
not handle the WARN_ON injected UD0 exception and panics.

Add the missing fixup call, so early UD's injected by WARN() get handled.

Fixes: 9a93848f ("x86/debug: Implement __WARN() using UD0")
Reported-and-tested-by: NHans de Goede <hdegoede@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Frank Mehnert <frank.mehnert@oracle.com>
Cc: Hans de Goede <hdegoede@redhat.com>
Cc: Michael Thayer <michael.thayer@oracle.com>
Link: http://lkml.kernel.org/r/20170612180108.w4vgu2ckucmllf3a@hirez.programming.kicks-ass.net

8a524f80

11 6月, 2017 2 次提交

KVM: async_pf: avoid async pf injection when in guest mode · 9bc1f09f

由 Wanpeng Li 提交于 6月 08, 2017

 INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
       Not tainted 4.12.0-rc4+ #8
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 gnome-terminal- D    0  1734   1015 0x00000000
 Call Trace:
  __schedule+0x3cd/0xb30
  schedule+0x40/0x90
  kvm_async_pf_task_wait+0x1cc/0x270
  ? __vfs_read+0x37/0x150
  ? prepare_to_swait+0x22/0x70
  do_async_page_fault+0x77/0xb0
  ? do_async_page_fault+0x77/0xb0
  async_page_fault+0x28/0x30

This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
and then gives stress to memory on L1, I can observed this hang on L1 when
at least ~70% swap area is occupied on L0.

This is due to async pf was injected to L2 which should be injected to L1,
L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
actually), and L1 guest starts accumulating tasks stuck in D state in
kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.

This patch fixes the hang by doing async pf when executing L1 guest.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9bc1f09f

hexagon: Use raw_copy_to_user · 4d801cca

由 Guenter Roeck 提交于 5月 02, 2017

Commit ac4691fa ("hexagon: switch to RAW_COPY_USER") replaced
__copy_to_user_hexagon() with raw_copy_to_user(), but did not catch
all callers, resulting in the following build error.

arch/hexagon/mm/uaccess.c: In function '__clear_user_hexagon':
arch/hexagon/mm/uaccess.c:40:3: error:
	implicit declaration of function '__copy_to_user_hexagon'

Fixes: ac4691fa ("hexagon: switch to RAW_COPY_USER")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Acked-by: NRichard Kuo <rkuo@codeaurora.org>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>

4d801cca

09 6月, 2017 1 次提交

security/keys: add CONFIG_KEYS_COMPAT to Kconfig · 47b2c3ff

由 Bilal Amarni 提交于 6月 08, 2017

CONFIG_KEYS_COMPAT is defined in arch-specific Kconfigs and is missing for
several 64-bit architectures : mips, parisc, tile.

At the moment and for those architectures, calling in 32-bit userspace the
keyctl syscall would return an ENOSYS error.

This patch moves the CONFIG_KEYS_COMPAT option to security/keys/Kconfig, to
make sure the compatibility wrapper is registered by default for any 64-bit
architecture as long as it is configured with CONFIG_COMPAT.

[DH: Modified to remove arm64 compat enablement also as requested by Eric
 Biggers]
Signed-off-by: NBilal Amarni <bilal.amarni@gmail.com>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
cc: Eric Biggers <ebiggers3@gmail.com>
Signed-off-by: NJames Morris <james.l.morris@oracle.com>

47b2c3ff

08 6月, 2017 13 次提交

M
s390: update defconfig · 16ddcc34
由 Martin Schwidefsky 提交于 1月 17, 2017
```
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
```
16ddcc34

MIPS: kprobes: flush_insn_slot should flush only if probe initialised · 698b8510

由 Marcin Nowakowski 提交于 6月 08, 2017

When ftrace is used with kprobes, it is possible for a kprobe to contain
an invalid location (ie. only initialised to 0 and not to a specific
location in the code). Trying to perform a cache flush on such location
leads to a crash r4k_flush_icache_range().

Fixes: c1bf207d ("MIPS: kprobe: Add support.")
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16296/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

698b8510

KVM: cpuid: Fix read/write out-of-bounds vulnerability in cpuid emulation · a3641631

由 Wanpeng Li 提交于 6月 08, 2017

If "i" is the last element in the vcpu->arch.cpuid_entries[] array, it
potentially can be exploited the vulnerability. this will out-of-bounds
read and write.  Luckily, the effect is small:

	/* when no next entry is found, the current entry[i] is reselected */
	for (j = i + 1; ; j = (j + 1) % nent) {
		struct kvm_cpuid_entry2 *ej = &vcpu->arch.cpuid_entries[j];
		if (ej->function == e->function) {

It reads ej->maxphyaddr, which is user controlled.  However...

			ej->flags |= KVM_CPUID_FLAG_STATE_READ_NEXT;

After cpuid_entries there is

	int maxphyaddr;
	struct x86_emulate_ctxt emulate_ctxt;  /* 16-byte aligned */

So we have:

- cpuid_entries at offset 1B50 (6992)
- maxphyaddr at offset 27D0 (6992 + 3200 = 10192)
- padding at 27D4...27DF
- emulate_ctxt at 27E0

And it writes in the padding.  Pfew, writing the ops field of emulate_ctxt
would have been much worse.

This patch fixes it by modding the index to avoid the out-of-bounds
access. Worst case, i == j and ej->function == e->function,
the loop can bail out.
Reported-by: NMoguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Guofang Mo <moguofang@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a3641631

MIPS: ftrace: fix init functions tracing · 87051ec1

由 Marcin Nowakowski 提交于 5月 23, 2017

Since introduction of tracing for init functions the in_kernel_space()
check is no longer correct, as it ignores the init sections. As a
result, when probes are inserted (and disabled) in the init functions,
a branch instruction is inserted instead of a nop, which is likely to
result in random crashes during boot.

Remove the MIPS-specific in_kernel_space() method and replace it with a
generic core_kernel_text() that also checks for init sections during
system boot stage.

Fixes: 42c269c8 ("ftrace: Allow for function tracing to record init functions on boot up")
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Tested-by: NMatt Redfearn <matt.redfearn@imgtec.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/16092/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

87051ec1

MIPS: mm: adjust PKMAP location · c56e7a4c

由 Marcin Nowakowski 提交于 4月 11, 2017

Space reserved for PKMap should span from PKMAP_BASE to FIXADDR_START.
For large page sizes this is not the case as eg. for 64k pages the range
currently defined is from 0xfe000000 to 0x102000000(!!) which obviously
isn't right.
Remove the hardcoded location and set the BASE address as an offset from
FIXADDR_START.

Since all PKMAP ptes have to be placed in a contiguous memory, ensure
that this is the case by placing them all in a single page. This is
achieved by aligning the end address to pkmap pages count pages.
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15950/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

c56e7a4c

MIPS: highmem: ensure that we don't use more than one page for PTEs · 725a269b

由 Marcin Nowakowski 提交于 4月 11, 2017

All PTEs used by PKMAP should be allocated in a contiguous memory area,
but we do not currently have a mechanism to enforce that, so ensure that
we don't try to allocate more entries than would fit in a single page.

Current fixed value of 1024 would not work with XPA enabled when
sizeof(pte_t)==8 and we need two pages to store pte tables.
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15949/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

725a269b

MIPS: mm: fixed mappings: correct initialisation · 71eb989a

由 Marcin Nowakowski 提交于 4月 11, 2017

fixrange_init operates at PMD-granularity and expects the addresses to
be PMD-size aligned, but currently that might not be the case for
PKMAP_BASE unless it is defined properly, so ensure a correct alignment
is used before passing the address to fixrange_init.

fixed mappings: only align the start address that is passed to
fixrange_init rather than the value before adding the size, as we may
end up with uninitialised upper part of the range.
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15948/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

71eb989a

MIPS: perf: Remove incorrect odd/even counter handling for I6400 · f7a31b5e

由 Marcin Nowakowski 提交于 4月 19, 2017

All performance counters on I6400 (odd and even) are capable of counting
any of the available events, so drop current logic of using the extra
bit to determine which counter to use.
Signed-off-by: NMarcin Nowakowski <marcin.nowakowski@imgtec.com>
Fixes: 4e88a862 ("MIPS: Add cases for CPU_I6400")
Fixes: fd716fca ("MIPS: perf: Fix I6400 event numbers")
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/15991/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>

f7a31b5e

powerpc/book3s64: Move PPC_DT_CPU_FTRs and enable it by default · c6ee9619

由 Michael Ellerman 提交于 6月 08, 2017

The PPC_DT_CPU_FTRs is a bit misplaced in menuconfig, it shows up with
other general kernel options. It's really more at home in the "Platform
Support" section, so move it there.

Also enable it by default, for Book3s 64. It does mostly nothing unless
the device tree properties are found, and we will want it enabled
eventually in distro kernels, so turn it on to start getting more
testing.

Fixes: 5a61ef74 ("powerpc/64s: Support new device tree binding for discovering CPU features")
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c6ee9619

powerpc/mm/4k: Limit 4k page size config to 64TB virtual address space · 92d9dfda

由 Aneesh Kumar K.V 提交于 6月 01, 2017

Supporting 512TB requires us to do a order 3 allocation for level 1 page
table (pgd). This results in page allocation failures with certain workloads.
For now limit 4k linux page size config to 64TB.

Fixes: f6eedbba ("powerpc/mm/hash: Increase VA range to 128TB")
Reported-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

92d9dfda

locking/x86: Remove the unused atomic_inc_short() methd · 31b35f6b

由 Dmitry Vyukov 提交于 5月 26, 2017

It is completely unused and implemented only on x86.
Remove it.
Suggested-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170526172900.91058-1-dvyukov@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

31b35f6b

x86/microcode/intel: Clear patch pointer before jettisoning the initrd · 5b0bc9ac

由 Dominik Brodowski 提交于 6月 07, 2017

During early boot, load_ucode_intel_ap() uses __load_ucode_intel()
to obtain a pointer to the relevant microcode patch (embedded in the
initrd), and stores this value in 'intel_ucode_patch' to speed up the
microcode patch application for subsequent CPUs.

On resuming from suspend-to-RAM, however, load_ucode_ap() calls
load_ucode_intel_ap() for each non-boot-CPU. By then the initramfs is
long gone so the pointer stored in 'intel_ucode_patch' no longer points to
a valid microcode patch.

Clear that pointer so that we effectively fall back to the CPU hotplug
notifier callbacks to update the microcode.
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
[ Edit and massage commit message. ]
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org> # 4.10..
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170607095819.9754-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

5b0bc9ac

bpf, arm64: use separate register for state in stxr · 7005cade

由 Daniel Borkmann 提交于 6月 07, 2017

Will reported that in BPF_XADD we must use a different register in stxr
instruction for the status flag due to otherwise CONSTRAINED UNPREDICTABLE
behavior per architecture. Reference manual says [1]:

  If s == t, then one of the following behaviors must occur:

   * The instruction is UNDEFINED.
   * The instruction executes as a NOP.
   * The instruction performs the store to the specified address, but
     the value stored is UNKNOWN.

Thus, use a different temporary register for the status flag to fix it.

Disassembly extract from test 226/STX_XADD_DW from test_bpf.ko:

  [...]
  0000003c:  c85f7d4b  ldxr x11, [x10]
  00000040:  8b07016b  add x11, x11, x7
  00000044:  c80c7d4b  stxr w12, x11, [x10]
  00000048:  35ffffac  cbnz w12, 0x0000003c
  [...]

  [1] https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf, p.6132

Fixes: 85f68fe8 ("bpf, arm64: implement jiting of BPF_XADD")
Reported-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7005cade

07 6月, 2017 13 次提交

sparc64: delete old wrap code · 0197e41c

由 Pavel Tatashin 提交于 5月 31, 2017

The old method that is using xcall and softint to get new context id is
deleted, as it is replaced by a method of using per_cpu_secondary_mm
without xcall to perform the context wrap.
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0197e41c

sparc64: new context wrap · a0582f26

由 Pavel Tatashin 提交于 5月 31, 2017

The current wrap implementation has a race issue: it is called outside of
the ctx_alloc_lock, and also does not wait for all CPUs to complete the
wrap. This means that a thread can get a new context with a new version
and another thread might still be running with the same context. The
problem is especially severe on CPUs with shared TLBs, like sun4v. I used
the following test to very quickly reproduce the problem:
- start over 8K processes (must be more than context IDs)
- write and read values at a memory location in every process.

Very quickly memory corruptions start happening, and what we read back
does not equal what we wrote.

Several approaches were explored before settling on this one:

Approach 1:
Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for
every process to complete the wrap. (Note: every CPU must WAIT before
leaving smp_new_mmu_context_version_client() until every one arrives).

This approach ends up with deadlocks, as some threads own locks which other
threads are waiting for, and they never receive softint until these threads
exit smp_new_mmu_context_version_client(). Since we do not allow the exit,
deadlock happens.

Approach 2:
Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into
into C code, and issue new versions to every CPU.
This approach adds some overhead to runtime: in switch_mm() we must add
some checks to make sure that versions have not changed due to wrap while
we were loading the new secondary context. (could be protected by PSTATE_IE
but that degrades performance as on M7 and older CPUs as it takes 50 cycles
for each access). Also, we still need a global per-cpu array of MMs to know
where we need to load new contexts, otherwise we can change context to a
thread that is going way (if we received mondo between switch_mm() and
switch_to() time). Finally, there are some issues with window registers in
rtrap() when context IDs are changed during CPU mondo time.

The approach in this patch is the simplest and has almost no impact on
runtime. We use the array with mm's where last secondary contexts were
loaded onto CPUs and bump their versions to the new generation without
changing context IDs. If a new process comes in to get a context ID, it
will go through get_new_mmu_context() because of version mismatch. But the
running processes do not need to be interrupted. And wrap is quicker as we
do not need to xcall and wait for everyone to receive and complete wrap.
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a0582f26

sparc64: add per-cpu mm of secondary contexts · 7a5b4bbf

由 Pavel Tatashin 提交于 5月 31, 2017

The new wrap is going to use information from this array to figure out
mm's that currently have valid secondary contexts setup.
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7a5b4bbf

sparc64: redefine first version · c4415235

由 Pavel Tatashin 提交于 5月 31, 2017

CTX_FIRST_VERSION defines the first context version, but also it defines
first context. This patch redefines it to only include the first context
version.
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c4415235

sparc64: combine activate_mm and switch_mm · 14d0334c

由 Pavel Tatashin 提交于 5月 31, 2017

The only difference between these two functions is that in activate_mm we
unconditionally flush context. However, there is no need to keep this
difference after fixing a bug where cpumask was not reset on a wrap. So, in
this patch we combine these.
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

14d0334c

sparc64: reset mm cpumask after wrap · 58897485

由 Pavel Tatashin 提交于 5月 31, 2017

After a wrap (getting a new context version) a process must get a new
context id, which means that we would need to flush the context id from
the TLB before running for the first time with this ID on every CPU. But,
we use mm_cpumask to determine if this process has been running on this CPU
before, and this mask is not reset after a wrap. So, there are two possible
fixes for this issue:

1. Clear mm cpumask whenever mm gets a new context id
2. Unconditionally flush context every time process is running on a CPU

This patch implements the first solution
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NSteven Sistare <steven.sistare@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58897485

sparc/mm/hugepages: Fix setup_hugepagesz for invalid values. · f322980b

由 Liam R. Howlett 提交于 5月 30, 2017

hugetlb_bad_size needs to be called on invalid values.  Also change the
pr_warn to a pr_err to better align with other platforms.
Signed-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f322980b

sparc: Machine description indices can vary · c982aa9c

由 James Clarke 提交于 5月 29, 2017

VIO devices were being looked up by their index in the machine
description node block, but this often varies over time as devices are
added and removed. Instead, store the ID and look up using the type,
config handle and ID.
Signed-off-by: NJames Clarke <jrtc27@jrtc27.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=112541Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c982aa9c

sparc64: mm: fix copy_tsb to correctly copy huge page TSBs · 654f4807

由 Mike Kravetz 提交于 6月 02, 2017

When a TSB grows beyond its current capacity, a new TSB is allocated
and copy_tsb is called to copy entries from the old TSB to the new.
A hash shift based on page size is used to calculate the index of an
entry in the TSB.  copy_tsb has hard coded PAGE_SHIFT in these
calculations.  However, for huge page TSBs the value REAL_HPAGE_SHIFT
should be used.  As a result, when copy_tsb is called for a huge page
TSB the entries are placed at the incorrect index in the newly
allocated TSB.  When doing hardware table walk, the MMU does not
match these entries and we end up in the TSB miss handling code.
This code will then create and write an entry to the correct index
in the TSB.  We take a performance hit for the table walk miss and
recreation of these entries.

Pass a new parameter to copy_tsb that is the page size shift to be
used when copying the TSB.
Suggested-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

654f4807

arch/sparc: support NR_CPUS = 4096 · c79a1373

由 Jane Chu 提交于 6月 06, 2017

Linux SPARC64 limits NR_CPUS to 4064 because init_cpu_send_mondo_info()
only allocates a single page for NR_CPUS mondo entries. Thus we cannot
use all 4096 CPUs on some SPARC platforms.

To fix, allocate (2^order) pages where order is set according to the size
of cpu_list for possible cpus. Since cpu_list_pa and cpu_mondo_block_pa
are not used in asm code, there are no imm13 offsets from the base PA
that will break because they can only reach one page.

Orabug: 25505750
Signed-off-by: NJane Chu <jane.chu@oracle.com>
Reviewed-by: NBob Picco <bob.picco@oracle.com>
Reviewed-by: NAtish Patra <atish.patra@oracle.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c79a1373

arm: KVM: Allow unaligned accesses at HYP · 33b5c388

由 Marc Zyngier 提交于 6月 06, 2017

We currently have the HSCTLR.A bit set, trapping unaligned accesses
at HYP, but we're not really prepared to deal with it.

Since the rest of the kernel is pretty happy about that, let's follow
its example and set HSCTLR.A to zero. Modern CPUs don't really care.

Cc: stable@vger.kernel.org
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NChristoffer Dall <cdall@linaro.org>

33b5c388

arm64: KVM: Allow unaligned accesses at EL2 · 78fd6dcf

由 Marc Zyngier 提交于 6月 06, 2017

We currently have the SCTLR_EL2.A bit set, trapping unaligned accesses
at EL2, but we're not really prepared to deal with it. So far, this
has been unnoticed, until GCC 7 started emitting those (in particular
64bit writes on a 32bit boundary).

Since the rest of the kernel is pretty happy about that, let's follow
its example and set SCTLR_EL2.A to zero. Modern CPUs don't really
care.

Cc: stable@vger.kernel.org
Reported-by: NAlexander Graf <agraf@suse.de>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NChristoffer Dall <cdall@linaro.org>

78fd6dcf

arm64: KVM: Preserve RES1 bits in SCTLR_EL2 · d68c1f7f

由 Marc Zyngier 提交于 6月 06, 2017

__do_hyp_init has the rather bad habit of ignoring RES1 bits and
writing them back as zero. On a v8.0-8.2 CPU, this doesn't do anything
bad, but may end-up being pretty nasty on future revisions of the
architecture.

Let's preserve those bits so that we don't have to fix this later on.

Cc: stable@vger.kernel.org
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NChristoffer Dall <cdall@linaro.org>

d68c1f7f

06 6月, 2017 3 次提交

KVM: nVMX: Fix exception injection · d4912215

由 Wanpeng Li 提交于 6月 05, 2017

 WARNING: CPU: 3 PID: 2840 at arch/x86/kvm/vmx.c:10966 nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
 CPU: 3 PID: 2840 Comm: qemu-system-x86 Tainted: G           OE   4.12.0-rc3+ #23
 RIP: 0010:nested_vmx_vmexit+0xdcd/0xde0 [kvm_intel]
 Call Trace:
  ? kvm_check_async_pf_completion+0xef/0x120 [kvm]
  ? rcu_read_lock_sched_held+0x79/0x80
  vmx_queue_exception+0x104/0x160 [kvm_intel]
  ? vmx_queue_exception+0x104/0x160 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0x1171/0x1ce0 [kvm]
  ? kvm_arch_vcpu_load+0x47/0x240 [kvm]
  ? kvm_arch_vcpu_load+0x62/0x240 [kvm]
  kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
  ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
  ? __fget+0xf3/0x210
  do_vfs_ioctl+0xa4/0x700
  ? __fget+0x114/0x210
  SyS_ioctl+0x79/0x90
  do_syscall_64+0x81/0x220
  entry_SYSCALL64_slow_path+0x25/0x25

This is triggered occasionally by running both win7 and win2016 in L2, in
addition, EPT is disabled on both L1 and L2. It can't be reproduced easily.

Commit 0b6ac343 (KVM: nVMX: Correct handling of exception injection) mentioned
that "KVM wants to inject page-faults which it got to the guest. This function
assumes it is called with the exit reason in vmcs02 being a #PF exception".
Commit e011c663 (KVM: nVMX: Check all exceptions for intercept during delivery to
L2) allows to check all exceptions for intercept during delivery to L2. However,
there is no guarantee the exit reason is exception currently, when there is an
external interrupt occurred on host, maybe a time interrupt for host which should
not be injected to guest, and somewhere queues an exception, then the function
nested_vmx_check_exception() will be called and the vmexit emulation codes will
try to emulate the "Acknowledge interrupt on exit" behavior, the warning is
triggered.

Reusing the exit reason from the L2->L0 vmexit is wrong in this case,
the reason must always be EXCEPTION_NMI when injecting an exception into
L1 as a nested vmexit.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Fixes: e011c663 ("KVM: nVMX: Check all exceptions for intercept during delivery to L2")
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d4912215

kvm: async_pf: fix rcu_irq_enter() with irqs enabled · bbaf0e2b

由 Paolo Bonzini 提交于 4月 26, 2017

native_safe_halt enables interrupts, and you just shouldn't
call rcu_irq_enter() with interrupts enabled.  Reorder the
call with the following local_irq_disable() to respect the
invariant.
Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

bbaf0e2b

powerpc/perf: Fix Power9 test_adder fields · 8c218578

由 Madhavan Srinivasan 提交于 5月 26, 2017

Commit 8d911904 ('powerpc/perf: Add restrictions to PMC5 in power9 DD1')
was added to restrict the use of PMC5 in Power9 DD1. Intention was to disable
the use of PMC5 using raw event code. But instead of updating the
power9_isa207_pmu structure (used on DD1), the commit incorrectly updated the
power9_pmu structure. Fix it.

Fixes: 8d911904 ("powerpc/perf: Add restrictions to PMC5 in power9 DD1")
Reported-by: NShriya <shriyak@linux.vnet.ibm.com>
Signed-off-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Tested-by: NShriya <shriyak@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

8c218578