提交 · d22deab6960a6cb015a36e74a2dcbab6ca9f5544 · openeuler / Kernel

23 8月, 2019 2 次提交

KVM: PPC: Book3S HV: Define usage types for rmap array in guest memslot · d22deab6

由 Suraj Jitindar Singh 提交于 8月 20, 2019

The rmap array in the guest memslot is an array of size number of guest
pages, allocated at memslot creation time. Each rmap entry in this array
is used to store information about the guest page to which it
corresponds. For example for a hpt guest it is used to store a lock bit,
rc bits, a present bit and the index of a hpt entry in the guest hpt
which maps this page. For a radix guest which is running nested guests
it is used to store a pointer to a linked list of nested rmap entries
which store the nested guest physical address which maps this guest
address and for which there is a pte in the shadow page table.

As there are currently two uses for the rmap array, and the potential
for this to expand to more in the future, define a type field (being the
top 8 bits of the rmap entry) to be used to define the type of the rmap
entry which is currently present and define two values for this field
for the two current uses of the rmap array.

Since the nested case uses the rmap entry to store a pointer, define
this type as having the two high bits set as is expected for a pointer.
Define the hpt entry type as having bit 56 set (bit 7 IBM bit ordering).
Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

d22deab6

KVM: PPC: Book3S: Mark expected switch fall-through · ff7240cc

由 Paul Menzel 提交于 7月 30, 2019

Fix the error below triggered by `-Wimplicit-fallthrough`, by tagging
it as an expected fall-through.

    arch/powerpc/kvm/book3s_32_mmu.c: In function ‘kvmppc_mmu_book3s_32_xlate_pte’:
    arch/powerpc/kvm/book3s_32_mmu.c:241:21: error: this statement may fall through [-Werror=implicit-fallthrough=]
          pte->may_write = true;
          ~~~~~~~~~~~~~~~^~~~~~
    arch/powerpc/kvm/book3s_32_mmu.c:242:5: note: here
         case 3:
         ^~~~
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>

ff7240cc

16 8月, 2019 4 次提交

powerpc/xive: Implement get_irqchip_state method for XIVE to fix shutdown race · da15c03b

由 Paul Mackerras 提交于 8月 13, 2019

Testing has revealed the existence of a race condition where a XIVE
interrupt being shut down can be in one of the XIVE interrupt queues
(of which there are up to 8 per CPU, one for each priority) at the
point where free_irq() is called.  If this happens, can return an
interrupt number which has been shut down.  This can lead to various
symptoms:

- irq_to_desc(irq) can be NULL.  In this case, no end-of-interrupt
  function gets called, resulting in the CPU's elevated interrupt
  priority (numerically lowered CPPR) never gets reset.  That then
  means that the CPU stops processing interrupts, causing device
  timeouts and other errors in various device drivers.

- The irq descriptor or related data structures can be in the process
  of being freed as the interrupt code is using them.  This typically
  leads to crashes due to bad pointer dereferences.

This race is basically what commit 62e04686 ("genirq: Add optional
hardware synchronization for shutdown", 2019-06-28) is intended to
fix, given a get_irqchip_state() method for the interrupt controller
being used.  It works by polling the interrupt controller when an
interrupt is being freed until the controller says it is not pending.

With XIVE, the PQ bits of the interrupt source indicate the state of
the interrupt source, and in particular the P bit goes from 0 to 1 at
the point where the hardware writes an entry into the interrupt queue
that this interrupt is directed towards.  Normally, the code will then
process the interrupt and do an end-of-interrupt (EOI) operation which
will reset PQ to 00 (assuming another interrupt hasn't been generated
in the meantime).  However, there are situations where the code resets
P even though a queue entry exists (for example, by setting PQ to 01,
which disables the interrupt source), and also situations where the
code leaves P at 1 after removing the queue entry (for example, this
is done for escalation interrupts so they cannot fire again until
they are explicitly re-enabled).

The code already has a 'saved_p' flag for the interrupt source which
indicates that a queue entry exists, although it isn't maintained
consistently.  This patch adds a 'stale_p' flag to indicate that
P has been left at 1 after processing a queue entry, and adds code
to set and clear saved_p and stale_p as necessary to maintain a
consistent indication of whether a queue entry may or may not exist.

With this, we can implement xive_get_irqchip_state() by looking at
stale_p, saved_p and the ESB PQ bits for the interrupt.

There is some additional code to handle escalation interrupts
properly; because they are enabled and disabled in KVM assembly code,
which does not have access to the xive_irq_data struct for the
escalation interrupt.  Hence, stale_p may be incorrect when the
escalation interrupt is freed in kvmppc_xive_{,native_}cleanup_vcpu().
Fortunately, we can fix it up by looking at vcpu->arch.xive_esc_on,
with some careful attention to barriers in order to ensure the correct
result if xive_esc_irq() races with kvmppc_xive_cleanup_vcpu().

Finally, this adds code to make noise on the console (pr_crit and
WARN_ON(1)) if we find an interrupt queue entry for an interrupt
which does not have a descriptor.  While this won't catch the race
reliably, if it does get triggered it will be an indication that
the race is occurring and needs to be debugged.

Fixes: 243e2511 ("powerpc/xive: Native exploitation of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100648.GE9567@blackberry

da15c03b

KVM: PPC: Book3S HV: Don't push XIVE context when not using XIVE device · 8d4ba9c9

由 Paul Mackerras 提交于 8月 13, 2019

At present, when running a guest on POWER9 using HV KVM but not using
an in-kernel interrupt controller (XICS or XIVE), for example if QEMU
is run with the kernel_irqchip=off option, the guest entry code goes
ahead and tries to load the guest context into the XIVE hardware, even
though no context has been set up.

To fix this, we check that the "CAM word" is non-zero before pushing
it to the hardware.  The CAM word is initialized to a non-zero value
in kvmppc_xive_connect_vcpu() and kvmppc_xive_native_connect_vcpu(),
and is now cleared in kvmppc_xive_{,native_}cleanup_vcpu.

Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Reported-by: NCédric Le Goater <clg@kaod.org>
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Reviewed-by: NCédric Le Goater <clg@kaod.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100100.GC9567@blackberry

8d4ba9c9

KVM: PPC: Book3S HV: Fix race in re-enabling XIVE escalation interrupts · 959c5d51

由 Paul Mackerras 提交于 8月 13, 2019

Escalation interrupts are interrupts sent to the host by the XIVE
hardware when it has an interrupt to deliver to a guest VCPU but that
VCPU is not running anywhere in the system. Hence we disable the
escalation interrupt for the VCPU being run when we enter the guest
and re-enable it when the guest does an H_CEDE hypercall indicating
it is idle.

It is possible that an escalation interrupt gets generated just as we
are entering the guest. In that case the escalation interrupt may be
using a queue entry in one of the interrupt queues, and that queue
entry may not have been processed when the guest exits with an H_CEDE.
The existing entry code detects this situation and does not clear the
vcpu->arch.xive_esc_on flag as an indication that there is a pending
queue entry (if the queue entry gets processed, xive_esc_irq() will
clear the flag). There is a comment in the code saying that if the
flag is still set on H_CEDE, we have to abort the cede rather than
re-enabling the escalation interrupt, lest we end up with two
occurrences of the escalation interrupt in the interrupt queue.

However, the exit code doesn't do that; it aborts the cede in the sense
that vcpu->arch.ceded gets cleared, but it still enables the escalation
interrupt by setting the source's PQ bits to 00. Instead we need to
set the PQ bits to 10, indicating that an interrupt has been triggered.
We also need to avoid setting vcpu->arch.xive_esc_on in this case
(i.e. vcpu->arch.xive_esc_on seen to be set on H_CEDE) because
xive_esc_irq() will run at some point and clear it, and if we race with
that we may end up with an incorrect result (i.e. xive_esc_on set when
the escalation interrupt has just been handled).

It is extremely unlikely that having two queue entries would cause
observable problems; theoretically it could cause queue overflow, but
the CPU would have to have thousands of interrupts targetted to it for
that to be possible. However, this fix will also make it possible to
determine accurately whether there is an unhandled escalation
interrupt in the queue, which will be needed by the following patch.

Fixes: 9b9b13a6 ("KVM: PPC: Book3S HV: Keep XIVE escalation interrupt masked unless ceded")
Cc: stable@vger.kernel.org # v4.16+
Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190813100349.GD9567@blackberry

959c5d51

KVM: PPC: Book3S HV: XIVE: Free escalation interrupts before disabling the VP · 237aed48

由 Cédric Le Goater 提交于 8月 06, 2019

When a vCPU is brought done, the XIVE VP (Virtual Processor) is first
disabled and then the event notification queues are freed. When freeing
the queues, we check for possible escalation interrupts and free them
also.

But when a XIVE VP is disabled, the underlying XIVE ENDs also are
disabled in OPAL. When an END (Event Notification Descriptor) is
disabled, its ESB pages (ESn and ESe) are disabled and loads return all
1s. Which means that any access on the ESB page of the escalation
interrupt will return invalid values.

When an interrupt is freed, the shutdown handler computes a 'saved_p'
field from the value returned by a load in xive_do_source_set_mask().
This value is incorrect for escalation interrupts for the reason
described above.

This has no impact on Linux/KVM today because we don't make use of it
but we will introduce in future changes a xive_get_irqchip_state()
handler. This handler will use the 'saved_p' field to return the state
of an interrupt and 'saved_p' being incorrect, softlockup will occur.

Fix the vCPU cleanup sequence by first freeing the escalation interrupts
if any, then disable the XIVE VP and last free the queues.

Fixes: 90c73795 ("KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation mode")
Fixes: 5af50993 ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller")
Cc: stable@vger.kernel.org # v4.12+
Signed-off-by: NCédric Le Goater <clg@kaod.org>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190806172538.5087-1-clg@kaod.org

237aed48

26 7月, 2019 3 次提交

s390/mm: use shared variables for sysctl range check · ac7a0fce

由 Vasily Gorbik 提交于 6月 26, 2019

Since commit eec4844f ("proc/sysctl: add shared variables for range
check") special shared variables are available for sysctl range check.
Reuse them for /proc/sys/vm/allocate_pgste proc handler.
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

ac7a0fce

s390/dma: provide proper ARCH_ZONE_DMA_BITS value · 1a2dcff8

由 Halil Pasic 提交于 7月 24, 2019

On s390 ZONE_DMA is up to 2G, i.e. ARCH_ZONE_DMA_BITS should be 31 bits.
The current value is 24 and makes __dma_direct_alloc_pages() take a
wrong turn first (but __dma_direct_alloc_pages() recovers then).

Let's correct ARCH_ZONE_DMA_BITS value and avoid wrong turns.
Signed-off-by: NHalil Pasic <pasic@linux.ibm.com>
Reported-by: NPetr Tesarik <ptesarik@suse.cz>
Fixes: c61e9637 ("dma-direct: add support for allocation from ZONE_DMA and ZONE_DMA32")
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

1a2dcff8

perf/x86/intel: Mark expected switch fall-throughs · 7b26b91d

由 Gustavo A. R. Silva 提交于 6月 24, 2019

In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warnings:

arch/x86/events/intel/core.c: In function ‘intel_pmu_init’:
arch/x86/events/intel/core.c:4959:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
   pmem = true;
   ~~~~~^~~~~~
arch/x86/events/intel/core.c:4960:2: note: here
  case INTEL_FAM6_SKYLAKE_MOBILE:
  ^~~~
arch/x86/events/intel/core.c:5008:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
   pmem = true;
   ~~~~~^~~~~~
arch/x86/events/intel/core.c:5009:2: note: here
  case INTEL_FAM6_ICELAKE_MOBILE:
  ^~~~

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable
-Wimplicit-fallthrough.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>

7b26b91d

25 7月, 2019 7 次提交

perf/x86/intel: Mark expected switch fall-throughs · 289a2d22

由 Gustavo A. R. Silva 提交于 6月 24, 2019

In preparation to enabling -Wimplicit-fallthrough, mark switch
cases where we are expecting to fall through.

This patch fixes the following warnings:

  arch/x86/events/intel/core.c: In function ‘intel_pmu_init’:
  arch/x86/events/intel/core.c:4959:8: warning: this statement may fall through [-Wimplicit-fallthrough=]
  arch/x86/events/intel/core.c:5008:8: warning: this statement may fall through [-Wimplicit-fallthrough=]

Warning level 3 was used: -Wimplicit-fallthrough=3

This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190624161913.GA32270@embeddedorSigned-off-by: NIngo Molnar <mingo@kernel.org>

289a2d22

perf/x86: Apply more accurate check on hypervisor platform · 5ea3f6fb

由 Zhenzhong Duan 提交于 7月 25, 2019

check_msr is used to fix a bug report in guest where KVM doesn't support
LBR MSR and cause #GP.

The msr check is bypassed on real HW to workaround a false failure,
see commit d0e1a507 ("perf/x86/intel: Disable check_msr for real HW")

When running a guest with CONFIG_HYPERVISOR_GUEST not set or "nopv"
enabled, current check isn't enough and #GP could trigger.
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1564022366-18293-1-git-send-email-zhenzhong.duan@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

5ea3f6fb

perf/x86/intel: Fix invalid Bit 13 for Icelake MSR_OFFCORE_RSP_x register · 3b238a64

由 Yunying Sun 提交于 7月 24, 2019

The Intel SDM states that bit 13 of Icelake's MSR_OFFCORE_RSP_x
register is valid, and used for counting hardware generated prefetches
of L3 cache. Update the bitmask to allow bit 13.

Before:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
   <not supported>      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u

After:
$ perf stat -e cpu/event=0xb7,umask=0x1,config1=0x1bfff/u sleep 3
 Performance counter stats for 'sleep 3':
             9,293      cpu/event=0xb7,umask=0x1,config1=0x1bfff/u
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NKan Liang <kan.liang@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: alexander.shishkin@linux.intel.com
Cc: bp@alien8.de
Cc: hpa@zytor.com
Cc: jolsa@redhat.com
Cc: namhyung@kernel.org
Link: https://lkml.kernel.org/r/20190724082932.12833-1-yunying.sun@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

3b238a64

perf/x86/intel: Fix SLOTS PEBS event constraint · 3d0c3953

由 Kan Liang 提交于 7月 23, 2019

Sampling SLOTS event and ref-cycles event in a group on Icelake gives
EINVAL.

SLOTS event is the event stands for the fixed counter 3, not fixed
counter 2. Wrong mask was set to SLOTS event in
intel_icl_pebs_event_constraints[].
Reported-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 60176089 ("perf/x86/intel: Add Icelake support")
Link: https://lkml.kernel.org/r/20190723200429.8180-1-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

3d0c3953

x86/speculation/mds: Apply more accurate check on hypervisor platform · 517c3ba0

由 Zhenzhong Duan 提交于 7月 25, 2019

X86_HYPER_NATIVE isn't accurate for checking if running on native platform,
e.g. CONFIG_HYPERVISOR_GUEST isn't set or "nopv" is enabled.

Checking the CPU feature bit X86_FEATURE_HYPERVISOR to determine if it's
running on native platform is more accurate.

This still doesn't cover the platforms on which X86_FEATURE_HYPERVISOR is
unsupported, e.g. VMware, but there is nothing which can be done about this
scenario.

Fixes: 8a4b06d3 ("x86/speculation/mds: Add sysfs reporting for MDS")
Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1564022349-17338-1-git-send-email-zhenzhong.duan@oracle.com

517c3ba0

x86/hpet: Undo the early counter is counting check · 643d83f0

由 Thomas Gleixner 提交于 7月 25, 2019

Rui reported that on a Pentium D machine which has HPET forced enabled
because it is not advertised by ACPI, the early counter is counting check
leads to a silent boot hang.

The reason is that the ordering of checking the counter first and then
reconfiguring the HPET fails to work on that machine. As the HPET is not
advertised and presumably not initialized by the BIOS the early enable and
the following reconfiguration seems to bring it into a broken state. Adding
clocksource=jiffies to the command line results in the following
clocksource watchdog warning:

clocksource: timekeeping watchdog on CPU1:
Marking clocksource 'tsc-early' as unstable because the skew is too large:
clocksource: 'hpet' wd_now: 33 wd_last: 33 mask: ffffffff

That clearly shows that the HPET is not counting after it got reconfigured
and reenabled. If the counter is not working then the HPET timer is not
expiring either, which explains the boot hang.

Move the counter is counting check after the full configuration again to
unbreak these systems.
Reported-by: NRui Salvaterra <rsalvaterra@gmail.com>
Fixes: 3222daf9 ("x86/hpet: Separate counter check out of clocksource register code")
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907250810530.1791@nanos.tec.linutronix.de

643d83f0

treewide: add "WITH Linux-syscall-note" to SPDX tag of uapi headers · d9c52522

由 Masahiro Yamada 提交于 7月 25, 2019

UAPI headers licensed under GPL are supposed to have exception
"WITH Linux-syscall-note" so that they can be included into non-GPL
user space application code.

The exception note is missing in some UAPI headers.

Some of them slipped in by the treewide conversion commit b2441318
("License cleanup: add SPDX GPL-2.0 license identifier to files with
no license"). Just run:

  $ git show --oneline b2441318 -- arch/x86/include/uapi/asm/

I believe they are not intentional, and should be fixed too.

This patch was generated by the following script:

  git grep -l --not -e Linux-syscall-note --and -e SPDX-License-Identifier \
    -- :arch/*/include/uapi/asm/*.h :include/uapi/ :^*/Kbuild |
  while read file
  do
          sed -i -e '/[[:space:]]OR[[:space:]]/s/\(GPL-[^[:space:]]*\)/(\1 WITH Linux-syscall-note)/g' \
          -e '/[[:space:]]or[[:space:]]/s/\(GPL-[^[:space:]]*\)/(\1 WITH Linux-syscall-note)/g' \
          -e '/[[:space:]]OR[[:space:]]/!{/[[:space:]]or[[:space:]]/!s/\(GPL-[^[:space:]]*\)/\1 WITH Linux-syscall-note/g}' $file
  done

After this patch is applied, there are 5 UAPI headers that do not contain
"WITH Linux-syscall-note". They are kept untouched since this exception
applies only to GPL variants.

  $ git grep --not -e Linux-syscall-note --and -e SPDX-License-Identifier \
    -- :arch/*/include/uapi/asm/*.h :include/uapi/ :^*/Kbuild
  include/uapi/drm/panfrost_drm.h:/* SPDX-License-Identifier: MIT */
  include/uapi/linux/batman_adv.h:/* SPDX-License-Identifier: MIT */
  include/uapi/linux/qemu_fw_cfg.h:/* SPDX-License-Identifier: BSD-3-Clause */
  include/uapi/linux/vbox_err.h:/* SPDX-License-Identifier: MIT */
  include/uapi/linux/virtio_iommu.h:/* SPDX-License-Identifier: BSD-3-Clause */
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d9c52522

24 7月, 2019 7 次提交

KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption · 266e85a5

由 Wanpeng Li 提交于 7月 24, 2019

Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
introduces hybrid PV queued/unfair locks
 - queued mode (no starvation)
 - unfair mode (good performance on not heavily contended lock)
The lock waiter goes into the unfair mode especially in VMs with over-commit
vCPUs since increaing over-commitment increase the likehood that the queue
head vCPU may have been preempted and not actively spinning.

However, reschedule queue head vCPU timely to acquire the lock still can get
better performance than just depending on lock stealing in over-subscribe
scenario.

Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
             vanilla     boosting    improved
 1VM          23520        25040         6%
 2VM           8000        13600        70%
 3VM           3100         5400        74%

The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
head vCPU which is involuntary preemption or the one which is voluntary halt
due to fail to acquire the lock after a short spin in the guest.

Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

266e85a5

x86/entry/32: Pass cr2 to do_async_page_fault() · b8f70953

由 Matt Mullins 提交于 7月 23, 2019

Commit a0d14b89 ("x86/mm, tracing: Fix CR2 corruption") added the
address parameter to do_async_page_fault(), but does not pass it from the
32-bit entry point.  To plumb it through, factor-out
common_exception_read_cr2 in the same fashion as common_exception, and uses
it from both page_fault and async_page_fault.

For a 32-bit KVM guest, this fixes:

  Run /sbin/init as init process
  Starting init: /sbin/init exists but couldn't execute it (error -14)

Fixes: a0d14b89 ("x86/mm, tracing: Fix CR2 corruption")
Signed-off-by: NMatt Mullins <mmullins@fb.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20190724042058.24506-1-mmullins@fb.com

b8f70953

Documentation: move Documentation/virtual to Documentation/virt · 2f5947df

由 Christoph Hellwig 提交于 7月 24, 2019

Renaming docs seems to be en vogue at the moment, so fix on of the
grossly misnamed directories.  We usually never use "virtual" as
a shortcut for virtualization in the kernel, but always virt,
as seen in the virt/ top-level directory.  Fix up the documentation
to match that.

Fixes: ed16648e ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f5947df

ARM: defconfig: u8500: Add new drivers · 02a02425

由 Linus Walleij 提交于 7月 23, 2019

This enables the new or updates driver options for U8500
that got merged into v5.3-rc1:

- CMA, MCDE driver, LIMA driver and the Samsung S6D16D0 driver
  enabled by default bringing up the new graphics support.
  Include the LOGO so we can see when the graphics are live.
- We use the IIO hwmon bridge for reflecting temperature
  in the system.
- Set MUSB to PIO mode as this is the one working most stable
  for the time being.
- HWSPINLOCK needs to be set to get the hardware semaphore
  driver to compile and link properly.

Link: https://lore.kernel.org/r/20190723081523.13079-2-linus.walleij@linaro.orgSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NOlof Johansson <olof@lixom.net>

02a02425

ARM: defconfig: u8500: Refresh defconfig · 14d017be

由 Linus Walleij 提交于 7月 23, 2019

This refreshes the outdated U8500 defconfig: some options
moved around, PS/2 mouse is no longer default on, crypto
options moved around etc.

Link: https://lore.kernel.org/r/20190723081523.13079-1-linus.walleij@linaro.orgSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NOlof Johansson <olof@lixom.net>

14d017be

ARM: dts: bcm: bcm47094: add missing #cells for mdio-bus-mux · 3a9d2569

由 Arnd Bergmann 提交于 7月 22, 2019

The mdio-bus-mux has no #address-cells/#size-cells property,
which causes a few dtc warnings:

arch/arm/boot/dts/bcm47094-linksys-panamera.dts:129.4-18: Warning (reg_format): /mdio-bus-mux/mdio@200:reg: property has invalid length (4 bytes) (#address-cells == 2, #size-cells == 1)
arch/arm/boot/dts/bcm47094-linksys-panamera.dtb: Warning (pci_device_bus_num): Failed prerequisite 'reg_format'
arch/arm/boot/dts/bcm47094-linksys-panamera.dtb: Warning (i2c_bus_reg): Failed prerequisite 'reg_format'
arch/arm/boot/dts/bcm47094-linksys-panamera.dtb: Warning (spi_bus_reg): Failed prerequisite 'reg_format'
arch/arm/boot/dts/bcm47094-linksys-panamera.dts:128.22-132.5: Warning (avoid_default_addr_size): /mdio-bus-mux/mdio@200: Relying on default #address-cells value
arch/arm/boot/dts/bcm47094-linksys-panamera.dts:128.22-132.5: Warning (avoid_default_addr_size): /mdio-bus-mux/mdio@200: Relying on default #size-cells value

Add the normal cell numbers.

Link: https://lore.kernel.org/r/20190722145618.1155492-1-arnd@arndb.de
Fixes: 2bebdfcd ("ARM: dts: BCM5301X: Add support for Linksys EA9500")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NOlof Johansson <olof@lixom.net>

3a9d2569

ARM: davinci: fix sleep.S build error on ARMv4 · d64b212e

由 Arnd Bergmann 提交于 7月 22, 2019

When building a multiplatform kernel that includes armv4 support,
the default target CPU does not support the blx instruction,
which leads to a build failure:

arch/arm/mach-davinci/sleep.S: Assembler messages:
arch/arm/mach-davinci/sleep.S:56: Error: selected processor does not support `blx ip' in ARM mode

Add a .arch statement in the sources to make this file build.

Link: https://lore.kernel.org/r/20190722145211.1154785-1-arnd@arndb.deAcked-by: NSekhar Nori <nsekhar@ti.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NOlof Johansson <olof@lixom.net>

d64b212e

23 7月, 2019 10 次提交

s390/kasan: add bitops instrumentation · 9779048d

由 Vasily Gorbik 提交于 7月 14, 2019

Add KASAN instrumentation of architecture-specific asm implementation
of bitops. It also covers s390 specific *_inv functions.
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

9779048d

s390/bitops: make test functions return bool · 0a5c3c2f

由 Vasily Gorbik 提交于 7月 14, 2019

Make s390/bitops test functions return bool values. That enforces return
value range to 0 and 1 and matches with asm-generic/bitops-instrumented.h
declarations as well as some other architectures implementations.
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

0a5c3c2f

s390: wire up clone3 system call · 5518aed8

由 Vasily Gorbik 提交于 7月 14, 2019

Tested (64-bit and compat mode) using program from
http://lkml.kernel.org/r/20190604212930.jaaztvkent32b7d3@brauner.io
with the following:
       return syscall(__NR_clone, flags, 0, pidfd, 0, 0);
changed to:
       return syscall(__NR_clone, 0, flags, pidfd, 0, 0);
due to CLONE_BACKWARDS2.
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

5518aed8

s390: use __u{16,32,64} instead of uint{16,32,64}_t in uapi header · 061c9962

由 Masahiro Yamada 提交于 7月 21, 2019

When CONFIG_UAPI_HEADER_TEST=y, exported headers are compile-tested to
make sure they can be included from user-space.

Currently, zcrypt.h is excluded from the test coverage. To make it
join the compile-test, we need to fix the build errors attached below.

For a case like this, we decided to use __u{8,16,32,64} variable types
in this discussion:

  https://lkml.org/lkml/2019/6/5/18

Build log:

  CC      usr/include/asm/zcrypt.h.s
In file included from <command-line>:32:0:
./usr/include/asm/zcrypt.h:163:2: error: unknown type name ‘uint16_t’
  uint16_t cprb_len;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:168:2: error: unknown type name ‘uint32_t’
  uint32_t source_id;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:169:2: error: unknown type name ‘uint32_t’
  uint32_t target_id;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:170:2: error: unknown type name ‘uint32_t’
  uint32_t ret_code;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:171:2: error: unknown type name ‘uint32_t’
  uint32_t reserved1;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:172:2: error: unknown type name ‘uint32_t’
  uint32_t reserved2;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:173:2: error: unknown type name ‘uint32_t’
  uint32_t payload_len;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:182:2: error: unknown type name ‘uint16_t’
  uint16_t ap_id;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:183:2: error: unknown type name ‘uint16_t’
  uint16_t dom_id;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:198:2: error: unknown type name ‘uint16_t’
  uint16_t  targets_num;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:199:2: error: unknown type name ‘uint64_t’
  uint64_t  targets;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:200:2: error: unknown type name ‘uint64_t’
  uint64_t  weight;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:201:2: error: unknown type name ‘uint64_t’
  uint64_t  req_no;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:202:2: error: unknown type name ‘uint64_t’
  uint64_t  req_len;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:203:2: error: unknown type name ‘uint64_t’
  uint64_t  req;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:204:2: error: unknown type name ‘uint64_t’
  uint64_t  resp_len;
  ^~~~~~~~
./usr/include/asm/zcrypt.h:205:2: error: unknown type name ‘uint64_t’
  uint64_t  resp;
  ^~~~~~~~
Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

061c9962

s390/hypfs: fix a typo in the name of a function · 3f4b04e3

由 Christophe JAILLET 提交于 7月 21, 2019

Everything is about hypfs_..., except 'hpyfs_vm_create_guest()'
s/hpy/hyp/
Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

3f4b04e3

s390: enable detection of kernel version from bzImage · 6abe2819

由 Vasily Gorbik 提交于 7月 15, 2019

Extend "parmarea" to include an offset of the version string, which is
stored as 8-byte big endian value.

To retrieve version string from bzImage reliably, one should check the
presence of "S390EP" ascii string at 0x10008 (available since v3.2),
then read the version string offset from 0x10428 (which has been 0
since v3.2 up to now). The string is null terminated.

Could be retrieved with the following "file" command magic (requires
file v5.34):
8 string \x02\x00\x00\x18\x60\x00\x00\x50\x02\x00\x00\x68\x60\x00\x00\x50\x40\x40\x40\x40\x40\x40\x40\x40 Linux S390
>0x10008       string          S390EP
>>0x10428      bequad          >0
>>>(0x10428.Q) string          >\0             \b, version %s
Reported-by: NPetr Tesarik <ptesarik@suse.com>
Suggested-by: NPetr Tesarik <ptesarik@suse.com>
Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>

6abe2819

arm64: dts: imx8mq: fix SAI compatible · 8d014847

由 Lucas Stach 提交于 7月 17, 2019

The i.MX8M SAI block is not compatible with the i.MX6SX one, as the
register layout has changed due to two version registers being added
at the beginning of the address map. Remove the bogus compatible.

Fixes: 8c61538d ("arm64: dts: imx8mq: Add SAI2 node")
Signed-off-by: NLucas Stach <l.stach@pengutronix.de>
Reviewed-by: NDaniel Baluta <daniel.baluta@nxp.com>
Signed-off-by: NShawn Guo <shawnguo@kernel.org>

8d014847

arm64: dts: imx8mm: Correct SAI3 RXC/TXFS pin's mux option #1 · 52d09014

由 Anson Huang 提交于 7月 16, 2019

According to i.MX8MM reference manual Rev.1, 03/2019:

SAI3_RXC pin's mux option #1 should be GPT1_CLK, NOT GPT1_CAPTURE2;
SAI3_TXFS pin's mux option #1 should be GPT1_CAPTURE2, NOT GPT1_CLK.

Fixes: c1c9d413 ("dt-bindings: imx: Add pinctrl binding doc for imx8mm")
Signed-off-by: NAnson Huang <Anson.Huang@nxp.com>
Signed-off-by: NShawn Guo <shawnguo@kernel.org>

52d09014

riscv: dts: Add DT node for SiFive FU540 Ethernet controller driver · 26091eef

由 Yash Shah 提交于 7月 19, 2019

DT node for SiFive FU540-C000 GEMGXL Ethernet controller driver added
Signed-off-by: NYash Shah <yash.shah@sifive.com>
Reviewed-by: NSagar Kadam <sagar.kadam@sifive.com>
Cc: Andrew Lunn <andrew@lunn.ch>
[paul.walmsley@sifive.com: changed "phy1" to "phy0" at Andrew Lunn's
 suggestion]
Signed-off-by: NPaul Walmsley <paul.walmsley@sifive.com>

26091eef

riscv: include generic support for MSI irqdomains · 251a4488

由 Wesley Terpstra 提交于 5月 20, 2019

Some RISC-V systems include PCIe host controllers that support PCIe
message-signaled interrupts.  For this to work on Linux, we need to
enable PCI_MSI_IRQ_DOMAIN and define struct msi_alloc_info.  Support
for the latter is enabled by including the architecture-generic msi.h
include.
Signed-off-by: NWesley Terpstra <wesley@sifive.com>
[paul.walmsley@sifive.com: split initial patch into one arch/riscv
 patch and one drivers/pci patch]
Signed-off-by: NPaul Walmsley <paul.walmsley@sifive.com>

251a4488

22 7月, 2019 7 次提交

arm64: entry: SP Alignment Fault doesn't write to FAR_EL1 · 40ca0ce5

由 James Morse 提交于 7月 22, 2019

Comparing the arm-arm's  pseudocode for AArch64.PCAlignmentFault() with
AArch64.SPAlignmentFault() shows that SP faults don't copy the faulty-SP
to FAR_EL1, but this is where we read from, and the address we provide
to user-space with the BUS_ADRALN signal.

For user-space this value will be UNKNOWN due to the previous ERET to
user-space. If the last value is preserved, on systems with KASLR or KPTI
this will be the user-space link-register left in FAR_EL1 by tramp_exit().
Fix this to retrieve the original sp_el0 value, and pass this to
do_sp_pc_fault().

SP alignment faults from EL1 will cause us to take the fault again when
trying to store the pt_regs. This eventually takes us to the overflow
stack. Remove the ESR_ELx_EC_SP_ALIGN check as we will never make it
this far.

Fixes: 60ffc30d ("arm64: Exception handling")
Signed-off-by: NJames Morse <james.morse@arm.com>
[will: change label name and fleshed out comment]
Signed-off-by: NWill Deacon <will@kernel.org>

40ca0ce5

arm64: Force SSBS on context switch · cbdf8a18

由 Marc Zyngier 提交于 7月 22, 2019

On a CPU that doesn't support SSBS, PSTATE[12] is RES0.  In a system
where only some of the CPUs implement SSBS, we end-up losing track of
the SSBS bit across task migration.

To address this issue, let's force the SSBS bit on context switch.

Fixes: 8f04e8e6 ("arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3")
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
[will: inverted logic and added comments]
Signed-off-by: NWill Deacon <will@kernel.org>

cbdf8a18

powerpc/papr_scm: Force a scm-unbind if initial scm-bind fails · 3a855b7a

由 Vaibhav Jain 提交于 6月 29, 2019

In some cases initial bind of scm memory for an lpar can fail if
previously it wasn't released using a scm-unbind hcall. This situation
can arise due to panic of the previous kernel or forced lpar
fadump. In such cases the H_SCM_BIND_MEM return a H_OVERLAP error.

To mitigate such cases the patch updates papr_scm_probe() to force a
call to drc_pmem_unbind() in case the initial bind of scm memory fails
with EBUSY error. In case scm-bind operation again fails after the
forced scm-unbind then we follow the existing error path. We also
update drc_pmem_bind() to handle the H_OVERLAP error returned by phyp
and indicate it as a EBUSY error back to the caller.
Suggested-by: N"Oliver O'Halloran" <oohall@gmail.com>
Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190629160610.23402-4-vaibhav@linux.ibm.com

3a855b7a

powerpc/papr_scm: Update drc_pmem_unbind() to use H_SCM_UNBIND_ALL · 0d7fc080

由 Vaibhav Jain 提交于 6月 29, 2019

The new hcall named H_SCM_UNBIND_ALL has been introduce that can
unbind all or specific scm memory assigned to an lpar. This is
more efficient than using H_SCM_UNBIND_MEM as currently we don't
support partial unbind of scm memory.

Hence this patch proposes following changes to drc_pmem_unbind():

    * Update drc_pmem_unbind() to replace hcall H_SCM_UNBIND_MEM to
      H_SCM_UNBIND_ALL.

    * Update drc_pmem_unbind() to handles cases when PHYP asks the guest
      kernel to wait for specific amount of time before retrying the
      hcall via the 'LONG_BUSY' return value.

    * Ensure appropriate error code is returned back from the function
      in case of an error.
Reviewed-by: NOliver O'Halloran <oohall@gmail.com>
Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190629160610.23402-3-vaibhav@linux.ibm.com

0d7fc080

powerpc/pseries: Update SCM hcall op-codes in hvcall.h · 6d140e75

由 Vaibhav Jain 提交于 6月 29, 2019

Update the hvcalls.h to include op-codes for new hcalls introduce to
manage SCM memory. Also update existing hcall definitions to reflect
current papr specification for SCM.

The removed hcall op-codes H_SCM_MEM_QUERY, H_SCM_BLOCK_CLEAR were
transient proposals and there support was never implemented by
Power-VM nor they were used anywhere in Linux kernel. Hence we don't
expect anyone to be impacted by this change.
Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190629160610.23402-2-vaibhav@linux.ibm.com

6d140e75

KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after free · c6bf2ae9

由 Jan Kiszka 提交于 7月 21, 2019

Shall help finding use-after-free bugs earlier.
Suggested-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c6bf2ae9

KVM: X86: Dynamically allocate user_fpu · d9a710e5

由 Wanpeng Li 提交于 7月 22, 2019

After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
is the order at which allocations are deemed costly to service. In serveless
scenario, one host can service hundreds/thoudands firecracker/kata-container
instances, howerver, new instance will fail to launch after memory is too
fragmented to allocate kvm_vcpu struct on host, this was observed in some
cloud provider product environments.

This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
Skylake server.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d9a710e5

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功