提交 · d38153f9ccc9b6b6a27a91559999292c27b72b8c · openeuler / Kernel

19 6月, 2019 9 次提交

powerpc/64s/radix: ioremap use ioremap_page_range · d38153f9

由 Nicholas Piggin 提交于 6月 10, 2019

Radix can use ioremap_page_range for ioremap, after slab is available.
This makes it possible to enable huge ioremap mapping support.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d38153f9

powerpc/64: __ioremap_at clean up in the error case · a72808a7

由 Nicholas Piggin 提交于 6月 10, 2019

__ioremap_at error handling is wonky, it requires caller to clean up
after it. Implement a helper that does the map and error cleanup and
remove the requirement from the caller.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a72808a7

powerpc/perf: Use cpumask_last() to determine the designated cpu for nest/core units. · 9c9f8fb7

由 Anju T Sudhakar 提交于 6月 10, 2019

Nest and core IMC (In-Memory Collection counters) assigns a particular
cpu as the designated target for counter data collection. During
system boot, the first online cpu in a chip gets assigned as the
designated cpu for that chip(for nest-imc) and the first online cpu in
a core gets assigned as the designated cpu for that core(for
core-imc).

If the designated cpu goes offline, the next online cpu from the same
chip(for nest-imc)/core(for core-imc) is assigned as the next target,
and the event context is migrated to the target cpu. Currently,
cpumask_any_but() function is used to find the target cpu. Though this
function is expected to return a `random` cpu, this always returns the
next online cpu.

If all cpus in a chip/core is offlined in a sequential manner,
starting from the first cpu, the event migration has to happen for all
the cpus which goes offline. Since the migration process involves a
grace period, the total time taken to offline all the cpus will be
significantly high.

Example:
  In a system which has 2 sockets, with
  NUMA node0 CPU(s):     0-87
  NUMA node8 CPU(s):     88-175

  Time taken to offline cpu 88-175:
  real    2m56.099s
  user    0m0.191s
  sys     0m0.000s

Use cpumask_last() to choose the target cpu, when the designated cpu
goes online, so the migration will happen only when the last_cpu in
the mask goes offline. This way the time taken to offline all cpus in
a chip/core can be reduced.

With the patch:

  Time taken  to offline cpu 88-175:
  real    0m12.207s
  user    0m0.171s
  sys     0m0.000s

Offlining all cpus in reverse order is also taken care because,
cpumask_any_but() is used to find the designated cpu if the last cpu
in the mask goes offline. Since cpumask_any_but() always return the
first cpu in the mask, that becomes the designated cpu and migration
will happen only when the first_cpu in the mask goes offline.

Example: With the patch,

  Time taken to offline cpu from 175-88:
  real    0m9.330s
  user    0m0.110s
  sys     0m0.000s
Signed-off-by: NAnju T Sudhakar <anju@linux.vnet.ibm.com>
Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9c9f8fb7

powerpc/64s: Fix misleading SPR and timebase information · 87997471

由 Shaokun Zhang 提交于 5月 29, 2019

pr_info shows SPR and timebase as a decimal value with a '0x'
prefix, which is somewhat misleading.

Fix it to print hexadecimal, as was intended.

Fixes: 10d91611 ("powerpc/64s: Reimplement book3s idle code in C")
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

87997471

powerpc/pseries: avoid blocking in irq when queuing hotplug events · 348ea30f

由 Nathan Lynch 提交于 5月 28, 2019

A couple of bugs in queue_hotplug_event():

1. Unchecked kmalloc result which could lead to an oops.
2. Use of GFP_KERNEL allocations in interrupt context (this code's
   only caller is ras_hotplug_interrupt()).

Use kmemdup to avoid open-coding the allocation+copy and check for
failure; use GFP_ATOMIC for both allocations.

Ultimately it probably would be better to avoid or reduce allocations
in this path if possible.
Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

348ea30f

powerpc/watchpoint: Restore NV GPRs while returning from exception · f474c28f

由 Ravi Bangoria 提交于 6月 13, 2019

powerpc hardware triggers watchpoint before executing the instruction.
To make trigger-after-execute behavior, kernel emulates the
instruction. If the instruction is 'load something into non-volatile
register', exception handler should restore emulated register state
while returning back, otherwise there will be register state
corruption. eg, adding a watchpoint on a list can corrput the list:

  # cat /proc/kallsyms | grep kthread_create_list
  c00000000121c8b8 d kthread_create_list

Add watchpoint on kthread_create_list->prev:

  # perf record -e mem:0xc00000000121c8c0

Run some workload such that new kthread gets invoked. eg, I just
logged out from console:

  list_add corruption. next->prev should be prev (c000000001214e00), \
	but was c00000000121c8b8. (next=c00000000121c8b8).
  WARNING: CPU: 59 PID: 309 at lib/list_debug.c:25 __list_add_valid+0xb4/0xc0
  CPU: 59 PID: 309 Comm: kworker/59:0 Kdump: loaded Not tainted 5.1.0-rc7+ #69
  ...
  NIP __list_add_valid+0xb4/0xc0
  LR __list_add_valid+0xb0/0xc0
  Call Trace:
  __list_add_valid+0xb0/0xc0 (unreliable)
  __kthread_create_on_node+0xe0/0x260
  kthread_create_on_node+0x34/0x50
  create_worker+0xe8/0x260
  worker_thread+0x444/0x560
  kthread+0x160/0x1a0
  ret_from_kernel_thread+0x5c/0x70

List corruption happened because it uses 'load into non-volatile
register' instruction:

Snippet from __kthread_create_on_node:

  c000000000136be8:     addis   r29,r2,-19
  c000000000136bec:     ld      r29,31424(r29)
        if (!__list_add_valid(new, prev, next))
  c000000000136bf0:     mr      r3,r30
  c000000000136bf4:     mr      r5,r28
  c000000000136bf8:     mr      r4,r29
  c000000000136bfc:     bl      c00000000059a2f8 <__list_add_valid+0x8>

Register state from WARN_ON():

  GPR00: c00000000059a3a0 c000007ff23afb50 c000000001344e00 0000000000000075
  GPR04: 0000000000000000 0000000000000000 0000001852af8bc1 0000000000000000
  GPR08: 0000000000000001 0000000000000007 0000000000000006 00000000000004aa
  GPR12: 0000000000000000 c000007ffffeb080 c000000000137038 c000005ff62aaa00
  GPR16: 0000000000000000 0000000000000000 c000007fffbe7600 c000007fffbe7370
  GPR20: c000007fffbe7320 c000007fffbe7300 c000000001373a00 0000000000000000
  GPR24: fffffffffffffef7 c00000000012e320 c000007ff23afcb0 c000000000cb8628
  GPR28: c00000000121c8b8 c000000001214e00 c000007fef5b17e8 c000007fef5b17c0

Watchpoint hit at 0xc000000000136bec.

  addis   r29,r2,-19
   => r29 = 0xc000000001344e00 + (-19 << 16)
   => r29 = 0xc000000001214e00

  ld      r29,31424(r29)
   => r29 = *(0xc000000001214e00 + 31424)
   => r29 = *(0xc00000000121c8c0)

0xc00000000121c8c0 is where we placed a watchpoint and thus this
instruction was emulated by emulate_step. But because handle_dabr_fault
did not restore emulated register state, r29 still contains stale
value in above register state.

Fixes: 5aae8a53 ("powerpc, hw_breakpoints: Implement hw_breakpoints for 64-bit server processors")
Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
Cc: stable@vger.kernel.org # 2.6.36+
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f474c28f

powerpc/ps3: Use [] to denote a flexible array member · 0b1be03f

由 Geert Uytterhoeven 提交于 6月 17, 2019

Flexible array members should be denoted using [] instead of [0], else
gcc will not warn when they are no longer at the end of the structure.
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0b1be03f

powerpc/mm/32s: fix condition that is always true · 46c2478a

由 Andreas Schwab 提交于 6月 17, 2019

Move a misplaced paren that makes the condition always true.

Fixes: 63b2bc61 ("powerpc/mm/32s: Use BATs for STRICT_KERNEL_RWX")
Cc: stable@vger.kernel.org # v5.1+
Signed-off-by: NAndreas Schwab <schwab@linux-m68k.org>
Reviewed-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

46c2478a

powerpc/32s: fix suspend/resume when IBATs 4-7 are used · 6ecb78ef

由 Christophe Leroy 提交于 6月 17, 2019

Previously, only IBAT1 and IBAT2 were used to map kernel linear mem.
Since commit 63b2bc61 ("powerpc/mm/32s: Use BATs for
STRICT_KERNEL_RWX"), we may have all 8 BATs used for mapping
kernel text. But the suspend/restore functions only save/restore
BATs 0 to 3, and clears BATs 4 to 7.

Make suspend and restore functions respectively save and reload
the 8 BATs on CPUs having MMU_FTR_USE_HIGH_BATS feature.
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Cc: stable@vger.kernel.org
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

6ecb78ef

15 6月, 2019 4 次提交

powerpc/64: mark start_here_multiplatform as __ref · 9c4e4c90

由 Christophe Leroy 提交于 5月 10, 2019

Otherwise, the following warning is encountered:

WARNING: vmlinux.o(.text+0x3dc6): Section mismatch in reference from the variable start_here_multiplatform to the function .init.text:.early_setup()
The function start_here_multiplatform() references
the function __init .early_setup().
This is often because start_here_multiplatform lacks a __init
annotation or the annotation of .early_setup is wrong.

Fixes: 56c46bba ("powerpc/64: Fix booting large kernels with STRICT_KERNEL_RWX")
Cc: Russell Currey <ruscur@russell.cc>
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9c4e4c90

powerpc/pseries/mobility: rebuild cacheinfo hierarchy post-migration · e610a466

由 Nathan Lynch 提交于 6月 11, 2019

It's common for the platform to replace the cache device nodes after a
migration. Since the cacheinfo code is never informed about this, it
never drops its references to the source system's cache nodes, causing
it to wind up in an inconsistent state resulting in warnings and oopses
as soon as CPU online/offline occurs after the migration, e.g.

  cache for /cpus/l3-cache@3113(Unified) refers to cache for /cpus/l2-cache@200d(Unified)
  WARNING: CPU: 15 PID: 86 at arch/powerpc/kernel/cacheinfo.c:176 release_cache+0x1bc/0x1d0
  [...]
  NIP release_cache+0x1bc/0x1d0
  LR  release_cache+0x1b8/0x1d0
  Call Trace:
    release_cache+0x1b8/0x1d0 (unreliable)
    cacheinfo_cpu_offline+0x1c4/0x2c0
    unregister_cpu_online+0x1b8/0x260
    cpuhp_invoke_callback+0x114/0xf40
    cpuhp_thread_fun+0x270/0x310
    smpboot_thread_fn+0x2c8/0x390
    kthread+0x1b8/0x1c0
    ret_from_kernel_thread+0x5c/0x68

Using device tree notifiers won't work since we want to rebuild the
hierarchy only after all the removals and additions have occurred and
the device tree is in a consistent state. Call cacheinfo_teardown()
before processing device tree updates, and rebuild the hierarchy
afterward.

Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel")
Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e610a466

powerpc/pseries/mobility: prevent cpu hotplug during DT update · e59a175f

由 Nathan Lynch 提交于 6月 11, 2019

CPU online/offline code paths are sensitive to parts of the device
tree (various cpu node properties, cache nodes) that can be changed as
a result of a migration.

Prevent CPU hotplug while the device tree potentially is inconsistent.

Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel")
Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

e59a175f

powerpc/cacheinfo: add cacheinfo_teardown, cacheinfo_rebuild · d4aa219a

由 Nathan Lynch 提交于 6月 11, 2019

Allow external callers to force the cacheinfo code to release all its
references to cache nodes, e.g. before processing device tree updates
post-migration, and to rebuild the hierarchy afterward.

CPU online/offline must be blocked by callers; enforce this.

Fixes: 410bccf9 ("powerpc/pseries: Partition migration in the kernel")
Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d4aa219a

14 6月, 2019 2 次提交

powerpc/pseries: Fix oops in hotplug memory notifier · 0aa82c48

由 Nathan Lynch 提交于 6月 07, 2019

During post-migration device tree updates, we can oops in
pseries_update_drconf_memory() if the source device tree has an
ibm,dynamic-memory-v2 property and the destination has a
ibm,dynamic_memory (v1) property. The notifier processes an "update"
for the ibm,dynamic-memory property but it's really an add in this
scenario. So make sure the old property object is there before
dereferencing it.

Fixes: 2b31e3ae ("powerpc/drmem: Add support for ibm, dynamic-memory-v2 property")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: NNathan Lynch <nathanl@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

0aa82c48

powerpc/pseries/hvconsole: Fix stack overread via udbg · 934bda59

由 Daniel Axtens 提交于 6月 03, 2019

While developing KASAN for 64-bit book3s, I hit the following stack
over-read.

It occurs because the hypercall to put characters onto the terminal
takes 2 longs (128 bits/16 bytes) of characters at a time, and so
hvc_put_chars() would unconditionally copy 16 bytes from the argument
buffer, regardless of supplied length. However, udbg_hvc_putc() can
call hvc_put_chars() with a single-byte buffer, leading to the error.

  ==================================================================
  BUG: KASAN: stack-out-of-bounds in hvc_put_chars+0xdc/0x110
  Read of size 8 at addr c0000000023e7a90 by task swapper/0

  CPU: 0 PID: 0 Comm: swapper Not tainted 5.2.0-rc2-next-20190528-02824-g048a6ab4835b #113
  Call Trace:
    dump_stack+0x104/0x154 (unreliable)
    print_address_description+0xa0/0x30c
    __kasan_report+0x20c/0x224
    kasan_report+0x18/0x30
    __asan_report_load8_noabort+0x24/0x40
    hvc_put_chars+0xdc/0x110
    hvterm_raw_put_chars+0x9c/0x110
    udbg_hvc_putc+0x154/0x200
    udbg_write+0xf0/0x240
    console_unlock+0x868/0xd30
    register_console+0x970/0xe90
    register_early_udbg_console+0xf8/0x114
    setup_arch+0x108/0x790
    start_kernel+0x104/0x784
    start_here_common+0x1c/0x534

  Memory state around the buggy address:
   c0000000023e7980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   c0000000023e7a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
  >c0000000023e7a80: f1 f1 01 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
                           ^
   c0000000023e7b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   c0000000023e7b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ==================================================================

Document that a 16-byte buffer is requred, and provide it in udbg.
Signed-off-by: NDaniel Axtens <dja@axtens.net>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

934bda59

02 6月, 2019 6 次提交

powerpc/pseries: Fix xive=off command line · a3bf9fbd

由 Greg Kurz 提交于 5月 15, 2019

On POWER9, if the hypervisor supports XIVE exploitation mode, the
guest OS will unconditionally requests for the XIVE interrupt mode
even if XIVE was deactivated with the kernel command line xive=off.
Later on, when the spapr XIVE init code handles xive=off, it disables
XIVE and tries to fall back on the legacy mode XICS.

This discrepency causes a kernel panic because the hypervisor is
configured to provide the XIVE interrupt mode to the guest :

  kernel BUG at arch/powerpc/sysdev/xics/xics-common.c:135!
  ...
  NIP xics_smp_probe+0x38/0x98
  LR  xics_smp_probe+0x2c/0x98
  Call Trace:
    xics_smp_probe+0x2c/0x98 (unreliable)
    pSeries_smp_probe+0x40/0xa0
    smp_prepare_cpus+0x62c/0x6ec
    kernel_init_freeable+0x148/0x448
    kernel_init+0x2c/0x148
    ret_from_kernel_thread+0x5c/0x68

Look for xive=off during prom_init and don't ask for XIVE in this
case. One exception though: if the host only supports XIVE, we still
want to boot so we ignore xive=off.

Similarly, have the spapr XIVE init code to looking at the interrupt
mode negotiated during CAS, and ignore xive=off if the hypervisor only
supports XIVE.

Fixes: eac1e731 ("powerpc/xive: guest exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.20
Reported-by: NPavithra R. Prakash <pavrampu@in.ibm.com>
Signed-off-by: NGreg Kurz <groug@kaod.org>
Reviewed-by: NCédric Le Goater <clg@kaod.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

a3bf9fbd

powerpc/powernv/npu: Fix reference leak · 02c5f539

由 Greg Kurz 提交于 4月 19, 2019

Since 902bdc57, get_pci_dev() calls pci_get_domain_bus_and_slot(). This
has the effect of incrementing the reference count of the PCI device, as
explained in drivers/pci/search.c:

 * Given a PCI domain, bus, and slot/function number, the desired PCI
 * device is located in the list of PCI devices. If the device is
 * found, its reference count is increased and this function returns a
 * pointer to its data structure.  The caller must decrement the
 * reference count by calling pci_dev_put().  If no device is found,
 * %NULL is returned.

Nothing was done to call pci_dev_put() and the reference count of GPU and
NPU PCI devices rockets up.

A natural way to fix this would be to teach the callers about the change,
so that they call pci_dev_put() when done with the pointer. This turns
out to be quite intrusive, as it affects many paths in npu-dma.c,
pci-ioda.c and vfio_pci_nvlink2.c. Also, the issue appeared in 4.16 and
some affected code got moved around since then: it would be problematic
to backport the fix to stable releases.

All that code never cared for reference counting anyway. Call pci_dev_put()
from get_pci_dev() to revert to the previous behavior.

Fixes: 902bdc57 ("powerpc/powernv/idoa: Remove unnecessary pcidev from pci_dn")
Cc: stable@vger.kernel.org # v4.16
Signed-off-by: NGreg Kurz <groug@kaod.org>
Reviewed-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

02c5f539

powerpc: Remove variable ‘path’ since not used · c806a6fd

由 Mathieu Malaterre 提交于 5月 23, 2019

In commit eab00a20 ("powerpc: Move `path` variable inside
DEBUG_PROM") DEBUG_PROM sentinels were added to silence a warning
(treated as error with W=1):

  arch/powerpc/kernel/prom_init.c:1388:8: error: variable ‘path’ set but not used [-Werror=unused-but-set-variable]

Rework the original patch and simplify the code, by removing the
variable ‘path’ completely. Fix line over 90 characters.
Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NMathieu Malaterre <malat@debian.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

c806a6fd

powerpc/powernv: Show checkstop reason for NPU2 HMIs · 89d87bcb

由 Frederic Barrat 提交于 5月 23, 2019

If the kernel is notified of an HMI caused by the NPU2, it's currently
not being recognized and it logs the default message:

    Unknown Malfunction Alert of type 3

The NPU on Power 9 has 3 Fault Isolation Registers, so that's a lot of
possible causes, but we should at least log that it's an NPU problem
and report which FIR and which bit were raised if opal gave us the
information.
Signed-off-by: NFrederic Barrat <fbarrat@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

89d87bcb

powerpc/powernv: Update firmware archaeology around OPAL_HANDLE_HMI · 1549c42d

由 Stewart Smith 提交于 5月 24, 2019

The first machines to ship with OPAL firmware all got firmware updates
that have the new call, but just in case someone is foolish enough to
believe the first 4 months of firmware is the best, we keep this code
around.

Comment is updated to not refer to late 2014 as recent or the future.
Signed-off-by: NStewart Smith <stewart@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

1549c42d

powerpc/pseries/dlpar: Fix a missing check in dlpar_parse_cc_property() · efa9ace6

由 Gen Zhang 提交于 5月 26, 2019

In dlpar_parse_cc_property(), 'prop->name' is allocated by kstrdup().
kstrdup() may return NULL, so it should be checked and handle error.
And prop should be freed if 'prop->name' is NULL.
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

efa9ace6

28 5月, 2019 3 次提交

powerpc/lib: only build ldstfp.o when CONFIG_PPC_FPU is set · 3e3ebed3

由 Christophe Leroy 提交于 5月 13, 2019

The entire code in ldstfp.o is enclosed into #ifdef CONFIG_PPC_FPU,
so there is no point in building it when this config is not selected.

Fixes: cd64d169 ("powerpc: mtmsrd not defined")
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

3e3ebed3

powerpc/lib: fix redundant inclusion of quad.o · f8e0d0fd

由 Christophe Leroy 提交于 5月 13, 2019

quad.o is only for PPC64, and already included in obj64-y,
so it doesn't have to be in obj-y

Fixes: 31bfdb03 ("powerpc: Use instruction emulation infrastructure to handle alignment faults")
Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

f8e0d0fd

powerpc/mm: Make some symbols static that can be · d667edc0

由 YueHaibing 提交于 5月 04, 2019

Noticed by sparse.
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

d667edc0

25 5月, 2019 16 次提交

KVM: x86: fix return value for reserved EFER · 66f61c92

由 Paolo Bonzini 提交于 5月 24, 2019

Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for
host-initiated writes", 2019-04-02) introduced a "return false" in a
function returning int, and anyway set_efer has a "nonzero on error"
conventon so it should be returning 1.
Reported-by: NPavel Machek <pavel@denx.de>
Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66f61c92

KVM: s390: fix memory slot handling for KVM_SET_USER_MEMORY_REGION · 19ec166c

由 Christian Borntraeger 提交于 5月 24, 2019

kselftests exposed a problem in the s390 handling for memory slots.
Right now we only do proper memory slot handling for creation of new
memory slots. Neither MOVE, nor DELETION are handled properly. Let us
implement those.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

19ec166c

KVM: x86/pmu: do not mask the value that is written to fixed PMUs · 2924b521

由 Paolo Bonzini 提交于 5月 20, 2019

According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of
each MSR may be written with any value, and the high-order 8 bits are
sign-extended according to the value of bit 31", but the fixed counters
in real hardware are limited to the width of the fixed counters ("bits
beyond the width of the fixed-function counter are reserved and must be
written as zeros"). Fix KVM to do the same.
Reported-by: NNadav Amit <nadav.amit@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2924b521

KVM: x86/pmu: mask the result of rdpmc according to the width of the counters · 0e6f467e

由 Paolo Bonzini 提交于 5月 20, 2019

This patch will simplify the changes in the next, by enforcing the
masking of the counters to RDPMC and RDMSR.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0e6f467e

x86/kvm/pmu: Set AMD's virt PMU version to 1 · a80c4ec1

由 Borislav Petkov 提交于 5月 08, 2019

After commit:

  672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")

my AMD guests started #GPing like this:

  general protection fault: 0000 [#1] PREEMPT SMP
  CPU: 1 PID: 4355 Comm: bash Not tainted 5.1.0-rc6+ #3
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
  RIP: 0010:x86_perf_event_update+0x3b/0xa0

with Code: pointing to RDPMC. It is RDPMC because the guest has the
hardware watchdog CONFIG_HARDLOCKUP_DETECTOR_PERF enabled which uses
perf. Instrumenting kvm_pmu_rdpmc() some, showed that it fails due to:

  if (!pmu->version)
  	return 1;

which the above commit added. Since AMD's PMU leaves the version at 0,
that causes the #GP injection into the guest.

Set pmu->version arbitrarily to 1 and move it above the non-applicable
struct kvm_pmu members.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Mihai Carabas <mihai.carabas@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: stable@vger.kernel.org
Fixes: 672ff6cf ("KVM: x86: Raise #GP when guest vCPU do not support PMU")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a80c4ec1

KVM: x86: do not spam dmesg with VMCS/VMCB dumps · 6f2f8453

由 Paolo Bonzini 提交于 5月 20, 2019

Userspace can easily set up invalid processor state in such a way that
dmesg will be filled with VMCS or VMCB dumps.  Disable this by default
using a module parameter.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6f2f8453

kvm: Check irqchip mode before assign irqfd · 654f1f13

由 Peter Xu 提交于 5月 05, 2019

When assigning kvm irqfd we didn't check the irqchip mode but we allow
KVM_IRQFD to succeed with all the irqchip modes.  However it does not
make much sense to create irqfd even without the kernel chips.  Let's
provide a arch-dependent helper to check whether a specific irqfd is
allowed by the arch.  At least for x86, it should make sense to check:

- when irqchip mode is NONE, all irqfds should be disallowed, and,

- when irqchip mode is SPLIT, irqfds that are with resamplefd should
  be disallowed.

For either of the case, previously we'll silently ignore the irq or
the irq ack event if the irqchip mode is incorrect.  However that can
cause misterious guest behaviors and it can be hard to triage.  Let's
fail KVM_IRQFD even earlier to detect these incorrect configurations.

CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Radim Krčmář <rkrcmar@redhat.com>
CC: Alex Williamson <alex.williamson@redhat.com>
CC: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

654f1f13

kvm: svm/avic: fix off-by-one in checking host APIC ID · c9bcd3e3

由 Suthikulpanit, Suravee 提交于 5月 14, 2019

Current logic does not allow VCPU to be loaded onto CPU with
APIC ID 255. This should be allowed since the host physical APIC ID
field in the AVIC Physical APIC table entry is an 8-bit value,
and APIC ID 255 is valid in system with x2APIC enabled.
Instead, do not allow VCPU load if the host APIC ID cannot be
represented by an 8-bit value.

Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
instead of AVIC_MAX_PHYSICAL_ID_COUNT.
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c9bcd3e3

KVM: LAPIC: Expose per-vCPU timer_advance_ns to userspace · 16ba3ab4

由 Wanpeng Li 提交于 5月 20, 2019

Expose per-vCPU timer_advance_ns to userspace, so it is able to
query the auto-adjusted value.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

16ba3ab4

KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow · 0e6edceb

由 Wanpeng Li 提交于 5月 20, 2019

After commit c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of
timer advancement), '-1' enables adaptive tuning starting from default
advancment of 1000ns. However, we should expose an int instead of an overflow
uint module parameter.

Before patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295

After patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:-1

Fixes: c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0e6edceb

kvm: vmx: Fix -Wmissing-prototypes warnings · 4d259965

由 Yi Wang 提交于 5月 20, 2019

We get a warning when build kernel W=1:
arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes]
 void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp)

Add the missing declaration to fix this.
Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d259965

KVM: nVMX: Fix using __this_cpu_read() in preemptible context · 541e886f

由 Wanpeng Li 提交于 5月 17, 2019

 BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590
  caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
  CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G           OE     5.1.0-rc4+ #1
  Call Trace:
   dump_stack+0x67/0x95
   __this_cpu_preempt_check+0xd2/0xe0
   nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel]
   nested_vmx_run+0xda/0x2b0 [kvm_intel]
   handle_vmlaunch+0x13/0x20 [kvm_intel]
   vmx_handle_exit+0xbd/0x660 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm]
   kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm]
   do_vfs_ioctl+0xa5/0x6e0
   ksys_ioctl+0x6d/0x80
   __x64_sys_ioctl+0x1a/0x20
   do_syscall_64+0x6f/0x6c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Accessing per-cpu variable should disable preemption, this patch extends the
preemption disable region for __this_cpu_read().

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Fixes: 52017608 ("KVM: nVMX: add option to perform early consistency checks via H/W")
Cc: stable@vger.kernel.org
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

541e886f

kvm: x86: Include CPUID leaf 0x8000001e in kvm's supported CPUID · 382409b4

由 Jim Mattson 提交于 3月 27, 2019

Kvm now supports extended CPUID functions through 0x8000001f.  CPUID
leaf 0x8000001e is AMD's Processor Topology Information leaf. This
contains similar information to CPUID leaf 0xb (Intel's Extended
Topology Enumeration leaf), and should be included in the output of
KVM_GET_SUPPORTED_CPUID, even though userspace is likely to override
some of this information based upon the configuration of the
particular VM.

Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Fixes: 8765d753 ("KVM: X86: Extend CPUID range to include new leaf")
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

382409b4

kvm: x86: Include multiple indices with CPUID leaf 0x8000001d · 32a243df

由 Jim Mattson 提交于 3月 27, 2019

Per the APM, "CPUID Fn8000_001D_E[D,C,B,A]X reports cache topology
information for the cache enumerated by the value passed to the
instruction in ECX, referred to as Cache n in the following
description. To gather information for all cache levels, software must
repeatedly execute CPUID with 8000_001Dh in EAX and ECX set to
increasing values beginning with 0 until a value of 00h is returned in
the field CacheType (EAX[4:0]) indicating no more cache descriptions
are available for this processor."

The termination condition is the same as leaf 4, so we can reuse that
code block for leaf 0x8000001d.

Fixes: 8765d753 ("KVM: X86: Extend CPUID range to include new leaf")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

32a243df

KVM: nVMX: Clear nested_run_pending if setting nested state fails · 21be4ca1

由 Sean Christopherson 提交于 5月 08, 2019

VMX's nested_run_pending flag is subtly consumed when stuffing state to
enter guest mode, i.e. needs to be set according before KVM knows if
setting guest state is successful. If setting guest state fails, clear
the flag as a nested run is obviously not pending.
Reported-by: NAaron Lewis <aaronlewis@google.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

21be4ca1

KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATE · db80927e

由 Paolo Bonzini 提交于 5月 20, 2019

The offset for reading the shadow VMCS is sizeof(*kvm_state)+VMCS12_SIZE,
so the correct size must be that plus sizeof(*vmcs12).  This could lead
to KVM reading garbage data from userspace and not reporting an error,
but is otherwise not sensitive.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

db80927e

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功