提交 · 8e19843c36abae08e1e541a65ce53fd2e88499fc · openeuler / Kernel

03 6月, 2020 7 次提交

x86/mm/64: implement arch_sync_kernel_mappings() · 8e19843c

由 Joerg Roedel 提交于 6月 01, 2020

Implement the function to sync changes in vmalloc and ioremap ranges to
all page-tables.
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NAndy Lutomirski <luto@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Link: http://lkml.kernel.org/r/20200515140023.25469-5-joro@8bytes.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8e19843c

mm: remove the pgprot argument to __vmalloc · 88dca4ca

由 Christoph Hellwig 提交于 6月 01, 2020

The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NWei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

88dca4ca

mm: enforce that vmap can't map pages executable · cca98e9f

由 Christoph Hellwig 提交于 6月 01, 2020

To help enforcing the W^X protection don't allow remapping existing pages
as executable.

x86 bits from Peter Zijlstra, arm64 bits from Mark Rutland.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>.
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-20-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cca98e9f

x86: fix vmap arguments in map_irq_stack · 03488011

由 Christoph Hellwig 提交于 6月 01, 2020

vmap does not take a gfp_t, the flags argument is for VM_* flags.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-3-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

03488011

x86/hyperv: use vmalloc_exec for the hypercall page · 78bb17f7

由 Christoph Hellwig 提交于 6月 01, 2020

Patch series "decruft the vmalloc API", v2.

Peter noticed that with some dumb luck you can toast the kernel address
space with exported vmalloc symbols.

I used this as an opportunity to decruft the vmalloc.c API and make it
much more systematic.  This also removes any chance to create vmalloc
mappings outside the designated areas or using executable permissions
from modules.  Besides that it removes more than 300 lines of code.

This patch (of 29):

Use the designated helper for allocating executable kernel memory, and
remove the now unused PAGE_KERNEL_RX define.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Acked-by: NWei Liu <wei.liu@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: http://lkml.kernel.org/r/20200414131348.444715-1-hch@lst.de
Link: http://lkml.kernel.org/r/20200414131348.444715-2-hch@lst.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

78bb17f7

mm: ptdump: expand type of 'val' in note_page() · 99395ee3

由 Steven Price 提交于 6月 01, 2020

The page table entry is passed in the 'val' argument to note_page(),
however this was previously an "unsigned long" which is fine on 64-bit
platforms.  But for 32 bit x86 it is not always big enough to contain a
page table entry which may be 64 bits.

Change the type to u64 to ensure that it is always big enough.

[akpm@linux-foundation.org: fix riscv]
Reported-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NSteven Price <steven.price@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200521152308.33096-3-steven.price@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

99395ee3

x86: mm: ptdump: calculate effective permissions correctly · 1494e0c3

由 Steven Price 提交于 6月 01, 2020

Patch series "Fix W+X debug feature on x86"

Jan alerted me[1] that the W+X detection debug feature was broken in x86
by my change[2] to switch x86 to use the generic ptdump infrastructure.

Fundamentally the approach of trying to move the calculation of
effective permissions into note_page() was broken because note_page() is
only called for 'leaf' entries and the effective permissions are passed
down via the internal nodes of the page tree.  The solution I've taken
here is to create a new (optional) callback which is called for all
nodes of the page tree and therefore can calculate the effective
permissions.

Secondly on some configurations (32 bit with PAE) "unsigned long" is not
large enough to store the table entries.  The fix here is simple - let's
just use a u64.

[1] https://lore.kernel.org/lkml/d573dc7e-e742-84de-473d-f971142fa319@suse.com/
[2] 2ae27137 ("x86: mm: convert dump_pagetables to use walk_page_range")

This patch (of 2):

By switching the x86 page table dump code to use the generic code the
effective permissions are no longer calculated correctly because the
note_page() function is only called for *leaf* entries.  To calculate
the actual effective permissions it is necessary to observe the full
hierarchy of the page tree.

Introduce a new callback for ptdump which is called for every entry and
can therefore update the prot_levels array correctly.  note_page() can
then simply access the appropriate element in the array.

[steven.price@arm.com: make the assignment conditional on val != 0]
  Link: http://lkml.kernel.org/r/430c8ab4-e7cd-6933-dde6-087fac6db872@arm.com
Fixes: 2ae27137 ("x86: mm: convert dump_pagetables to use walk_page_range")
Reported-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NSteven Price <steven.price@arm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200521152308.33096-1-steven.price@arm.com
Link: http://lkml.kernel.org/r/20200521152308.33096-2-steven.price@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1494e0c3

29 5月, 2020 2 次提交

x86/ioperm: Prevent a memory leak when fork fails · 4bfe6cce

由 Jay Lang 提交于 5月 24, 2020

In the copy_process() routine called by _do_fork(), failure to allocate
a PID (or further along in the function) will trigger an invocation to
exit_thread(). This is done to clean up from an earlier call to
copy_thread_tls(). Naturally, the child task is passed into exit_thread(),
however during the process, io_bitmap_exit() nullifies the parent's
io_bitmap rather than the child's.

As copy_thread_tls() has been called ahead of the failure, the reference
count on the calling thread's io_bitmap is incremented as we would expect.
However, io_bitmap_exit() doesn't accept any arguments, and thus assumes
it should trash the current thread's io_bitmap reference rather than the
child's. This is pretty sneaky in practice, because in all instances but
this one, exit_thread() is called with respect to the current task and
everything works out.

A determined attacker can issue an appropriate ioctl (i.e. KDENABIO) to
get a bitmap allocated, and force a clone3() syscall to fail by passing
in a zeroed clone_args structure. The kernel handles the erroneous struct
and the buggy code path is followed, and even though the parent's reference
to the io_bitmap is trashed, the child still holds a reference and thus
the structure will never be freed.

Fix this by tweaking io_bitmap_exit() and its subroutines to accept a
task_struct argument which to operate on.

Fixes: ea5f1cd7 ("x86/ioperm: Remove bitmap if all permissions dropped")
Signed-off-by: NJay Lang <jaytlang@mit.edu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable#@vger.kernel.org
Link: https://lkml.kernel.org/r/20200524162742.253727-1-jaytlang@mit.edu

4bfe6cce

x86/dma: Fix max PFN arithmetic overflow on 32 bit systems · 88743470

由 Alexander Dahl 提交于 5月 26, 2020

The intermediate result of the old term (4UL * 1024 * 1024 * 1024) is
4 294 967 296 or 0x100000000 which is no problem on 64 bit systems.
The patch does not change the later overall result of 0x100000 for
MAX_DMA32_PFN (after it has been shifted by PAGE_SHIFT). The new
calculation yields the same result, but does not require 64 bit
arithmetic.

On 32 bit systems the old calculation suffers from an arithmetic
overflow in that intermediate term in braces: 4UL aka unsigned long int
is 4 byte wide and an arithmetic overflow happens (the 0x100000000 does
not fit in 4 bytes), the in braces result is truncated to zero, the
following right shift does not alter that, so MAX_DMA32_PFN evaluates to
0 on 32 bit systems.

That wrong value is a problem in a comparision against MAX_DMA32_PFN in
the init code for swiotlb in pci_swiotlb_detect_4gb() to decide if
swiotlb should be active.  That comparison yields the opposite result,
when compiling on 32 bit systems.

This was not possible before

  1b7e03ef ("x86, NUMA: Enable emulation on 32bit too")

when that MAX_DMA32_PFN was first made visible to x86_32 (and which
landed in v3.0).

In practice this wasn't a problem, unless CONFIG_SWIOTLB is active on
x86-32.

However if one has set CONFIG_IOMMU_INTEL, since

  c5a5dc4c ("iommu/vt-d: Don't switch off swiotlb if bounce page is used")

there's a dependency on CONFIG_SWIOTLB, which was not necessarily
active before. That landed in v5.4, where we noticed it in the fli4l
Linux distribution. We have CONFIG_IOMMU_INTEL active on both 32 and 64
bit kernel configs there (I could not find out why, so let's just say
historical reasons).

The effect is at boot time 64 MiB (default size) were allocated for
bounce buffers now, which is a noticeable amount of memory on small
systems like pcengines ALIX 2D3 with 256 MiB memory, which are still
frequently used as home routers.

We noticed this effect when migrating from kernel v4.19 (LTS) to v5.4
(LTS) in fli4l and got that kernel messages for example:

  Linux version 5.4.22 (buildroot@buildroot) (gcc version 7.3.0 (Buildroot 2018.02.8)) #1 SMP Mon Nov 26 23:40:00 CET 2018
  …
  Memory: 183484K/261756K available (4594K kernel code, 393K rwdata, 1660K rodata, 536K init, 456K bss , 78272K reserved, 0K cma-reserved, 0K highmem)
  …
  PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
  software IO TLB: mapped [mem 0x0bb78000-0x0fb78000] (64MB)

The initial analysis and the suggested fix was done by user 'sourcejedi'
at stackoverflow and explicitly marked as GPLv2 for inclusion in the
Linux kernel:

  https://unix.stackexchange.com/a/520525/50007

The new calculation, which does not suffer from that overflow, is the
same as for arch/mips now as suggested by Robin Murphy.

The fix was tested by fli4l users on round about two dozen different
systems, including both 32 and 64 bit archs, bare metal and virtualized
machines.

 [ bp: Massage commit message. ]

Fixes: 1b7e03ef ("x86, NUMA: Enable emulation on 32bit too")
Reported-by: NAlan Jenkins <alan.christopher.jenkins@gmail.com>
Suggested-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NAlexander Dahl <post@lespocky.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://unix.stackexchange.com/q/520065/50007
Link: https://web.nettworks.org/bugs/browse/FFL-2560
Link: https://lkml.kernel.org/r/20200526175749.20742-1-post@lespocky.de

88743470

28 5月, 2020 1 次提交

copy_xstate_to_kernel(): don't leave parts of destination uninitialized · 9e463654

由 Al Viro 提交于 5月 26, 2020

copy the corresponding pieces of init_fpstate into the gaps instead.

Cc: stable@kernel.org
Tested-by: NAlexander Potapenko <glider@google.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9e463654

27 5月, 2020 1 次提交

x86: Hide the archdata.iommu field behind generic IOMMU_API · ed3119e4

由 Krzysztof Kozlowski 提交于 5月 18, 2020

There is a generic, kernel wide configuration symbol for enabling the
IOMMU specific bits: CONFIG_IOMMU_API. Implementations (including
INTEL_IOMMU and AMD_IOMMU driver) select it so use it here as well.

This makes the conditional archdata.iommu field consistent with other
platforms and also fixes any compile test builds of other IOMMU drivers,
when INTEL_IOMMU or AMD_IOMMU are not selected).

For the case when INTEL_IOMMU/AMD_IOMMU and COMPILE_TEST are not
selected, this should create functionally equivalent code/choice. With
COMPILE_TEST this field could appear if other IOMMU drivers are chosen
but neither INTEL_IOMMU nor AMD_IOMMU are not.
Reported-by: Nkbuild test robot <lkp@intel.com>
Fixes: e93a1695 ("iommu: Enable compile testing for some of drivers")
Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
Acked-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20200518120855.27822-2-krzk@kernel.orgSigned-off-by: NJoerg Roedel <jroedel@suse.de>

ed3119e4

26 5月, 2020 1 次提交

x86/syscalls: Revert "x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long" · 700d3a5a

由 Andy Lutomirski 提交于 5月 08, 2020

Revert

  45e29d11 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long")

and add a comment to discourage someone else from making the same
mistake again.

It turns out that some user code fails to compile if __X32_SYSCALL_BIT
is unsigned long. See, for example [1] below.

 [ bp: Massage and do the same thing in the respective tools/ header. ]

Fixes: 45e29d11 ("x86/syscalls: Make __X32_SYSCALL_BIT be unsigned long")
Reported-by: NThorsten Glaser <t.glaser@tarent.de>
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: stable@kernel.org
Link: [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=954294
Link: https://lkml.kernel.org/r/92e55442b744a5951fdc9cfee10badd0a5f7f828.1588983892.git.luto@kernel.org

700d3a5a

24 5月, 2020 1 次提交

x86: bitops: fix build regression · c071b0f1

由 Nick Desaulniers 提交于 5月 22, 2020

This is easily reproducible via CC=clang + CONFIG_STAGING=y +
CONFIG_VT6656=m.

It turns out that if your config tickles __builtin_constant_p via
differences in choices to inline or not, these statements produce
invalid assembly:

    $ cat foo.c
    long a(long b, long c) {
      asm("orb	%1, %0" : "+q"(c): "r"(b));
      return c;
    }
    $ gcc foo.c
    foo.c: Assembler messages:
    foo.c:2: Error: `%rax' not allowed with `orb'

Use the `%b` "x86 Operand Modifier" to instead force register allocation
to select a lower-8-bit GPR operand.

The "q" constraint only has meaning on -m32 otherwise is treated as
"r".  Not all GPRs have low-8-bit aliases for -m32.

Fixes: 1651e700 ("x86: Fix bitops.h warning with a moved cast")
Reported-by: Nkernelci.org bot <bot@kernelci.org>
Suggested-by: NAndy Shevchenko <andriy.shevchenko@intel.com>
Suggested-by: NBrian Gerst <brgerst@gmail.com>
Suggested-by: NH. Peter Anvin <hpa@zytor.com>
Suggested-by: NIlie Halip <ilie.halip@gmail.com>
Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Tested-by: Nathan Chancellor <natechancellor@gmail.com>	[build, clang-11]
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Reviewed-By: NBrian Gerst <brgerst@gmail.com>
Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Marco Elver <elver@google.com>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20200508183230.229464-1-ndesaulniers@google.com
Link: https://github.com/ClangBuiltLinux/linux/issues/961
Link: https://lore.kernel.org/lkml/20200504193524.GA221287@google.com/
Link: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#x86OperandmodifiersSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c071b0f1

23 5月, 2020 1 次提交

x86/unwind/orc: Fix unwind_get_return_address_ptr() for inactive tasks · 187b96db

由 Josh Poimboeuf 提交于 5月 22, 2020

Normally, show_trace_log_lvl() scans the stack, looking for text
addresses to print.  In parallel, it unwinds the stack with
unwind_next_frame().  If the stack address matches the pointer returned
by unwind_get_return_address_ptr() for the current frame, the text
address is printed normally without a question mark.  Otherwise it's
considered a breadcrumb (potentially from a previous call path) and it's
printed with a question mark to indicate that the address is unreliable
and typically can be ignored.

Since the following commit:

  f1d9a2ab ("x86/unwind/orc: Don't skip the first frame for inactive tasks")

... for inactive tasks, show_trace_log_lvl() prints *only* unreliable
addresses (prepended with '?').

That happens because, for the first frame of an inactive task,
unwind_get_return_address_ptr() returns the wrong return address
pointer: one word *below* the task stack pointer.  show_trace_log_lvl()
starts scanning at the stack pointer itself, so it never finds the first
'reliable' address, causing only guesses to being printed.

The first frame of an inactive task isn't a normal stack frame.  It's
actually just an instance of 'struct inactive_task_frame' which is left
behind by __switch_to_asm().  Now that this inactive frame is actually
exposed to callers, fix unwind_get_return_address_ptr() to interpret it
properly.

Fixes: f1d9a2ab ("x86/unwind/orc: Don't skip the first frame for inactive tasks")
Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200522135435.vbxs7umku5pyrdbk@treble

187b96db

20 5月, 2020 1 次提交

x86/mmiotrace: Use cpumask_available() for cpumask_var_t variables · d7110a26

由 Nathan Chancellor 提交于 4月 08, 2020

When building with Clang + -Wtautological-compare and
CONFIG_CPUMASK_OFFSTACK unset:

  arch/x86/mm/mmio-mod.c:375:6: warning: comparison of array 'downed_cpus'
  equal to a null pointer is always false [-Wtautological-pointer-compare]
          if (downed_cpus == NULL &&
              ^~~~~~~~~~~    ~~~~
  arch/x86/mm/mmio-mod.c:405:6: warning: comparison of array 'downed_cpus'
  equal to a null pointer is always false [-Wtautological-pointer-compare]
          if (downed_cpus == NULL || cpumask_weight(downed_cpus) == 0)
              ^~~~~~~~~~~    ~~~~
  2 warnings generated.

Commit

  f7e30f01 ("cpumask: Add helper cpumask_available()")

added cpumask_available() to fix warnings of this nature. Use that here
so that clang does not warn regardless of CONFIG_CPUMASK_OFFSTACK's
value.
Reported-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://github.com/ClangBuiltLinux/linux/issues/982
Link: https://lkml.kernel.org/r/20200408205323.44490-1-natechancellor@gmail.com

d7110a26

16 5月, 2020 1 次提交

KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce · c4e0e4ab

由 Jim Mattson 提交于 5月 11, 2020

Bank_num is a one-based count of banks, not a zero-based index. It
overflows the allocated space only when strictly greater than
KVM_MAX_MCE_BANKS.

Fixes: a9e38c3e ("KVM: x86: Catch potential overrun in MCE setup")
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Message-Id: <20200511225616.19557-1-jmattson@google.com>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c4e0e4ab

15 5月, 2020 3 次提交

bpf: Restrict bpf_probe_read{, str}() only to archs where they work · 0ebeea8c

由 Daniel Borkmann 提交于 5月 15, 2020

Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.

To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6ae08ae3 ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").

Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.

However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).

For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net

0ebeea8c

x86: Fix early boot crash on gcc-10, third try · a9a3ed1e

由 Borislav Petkov 提交于 4月 22, 2020

... or the odyssey of trying to disable the stack protector for the
function which generates the stack canary value.

The whole story started with Sergei reporting a boot crash with a kernel
built with gcc-10:

  Kernel panic — not syncing: stack-protector: Kernel stack is corrupted in: start_secondary
  CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc5—00235—gfffb08b3 #139
  Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./H77M—D3H, BIOS F12 11/14/2013
  Call Trace:
    dump_stack
    panic
    ? start_secondary
    __stack_chk_fail
    start_secondary
    secondary_startup_64
  -—-[ end Kernel panic — not syncing: stack—protector: Kernel stack is corrupted in: start_secondary

This happens because gcc-10 tail-call optimizes the last function call
in start_secondary() - cpu_startup_entry() - and thus emits a stack
canary check which fails because the canary value changes after the
boot_init_stack_canary() call.

To fix that, the initial attempt was to mark the one function which
generates the stack canary with:

  __attribute__((optimize("-fno-stack-protector"))) ... start_secondary(void *unused)

however, using the optimize attribute doesn't work cumulatively
as the attribute does not add to but rather replaces previously
supplied optimization options - roughly all -fxxx options.

The key one among them being -fno-omit-frame-pointer and thus leading to
not present frame pointer - frame pointer which the kernel needs.

The next attempt to prevent compilers from tail-call optimizing
the last function call cpu_startup_entry(), shy of carving out
start_secondary() into a separate compilation unit and building it with
-fno-stack-protector, was to add an empty asm("").

This current solution was short and sweet, and reportedly, is supported
by both compilers but we didn't get very far this time: future (LTO?)
optimization passes could potentially eliminate this, which leads us
to the third attempt: having an actual memory barrier there which the
compiler cannot ignore or move around etc.

That should hold for a long time, but hey we said that about the other
two solutions too so...
Reported-by: NSergei Trofimovich <slyfox@gentoo.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Tested-by: NKalle Valo <kvalo@codeaurora.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200314164451.346497-1-slyfox@gentoo.org

a9a3ed1e

x86/unwind/orc: Fix error handling in __unwind_start() · 71c95825

由 Josh Poimboeuf 提交于 5月 14, 2020

The unwind_state 'error' field is used to inform the reliable unwinding
code that the stack trace can't be trusted. Set this field for all
errors in __unwind_start().

Also, move the zeroing out of the unwind_state struct to before the ORC
table initialization check, to prevent the caller from reading
uninitialized data if the ORC table is corrupted.

Fixes: af085d90 ("stacktrace/x86: add function for detecting reliable stack traces")
Fixes: d3a09104 ("x86/unwinder/orc: Dont bail on stack overflow")
Fixes: 98d0c8eb ("x86/unwind/orc: Prevent unwinding before ORC initialization")
Reported-by: NPavel Machek <pavel@denx.de>
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d6ac7215a84ca92b895fdd2e1aa546729417e6e6.1589487277.git.jpoimboe@redhat.com

71c95825

14 5月, 2020 1 次提交

x86/boot: Mark global variables as static · e78d334a

由 Arvind Sankar 提交于 5月 11, 2020

Mike Lothian reports that after commit
  964124a9 ("efi/x86: Remove extra headroom for setup block")
gcc 10.1.0 fails with

  HOSTCC  arch/x86/boot/tools/build
  /usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/../../../../x86_64-pc-linux-gnu/bin/ld:
  error: linker defined: multiple definition of '_end'
  /usr/lib/gcc/x86_64-pc-linux-gnu/10.1.0/../../../../x86_64-pc-linux-gnu/bin/ld:
  /tmp/ccEkW0jM.o: previous definition here
  collect2: error: ld returned 1 exit status
  make[1]: *** [scripts/Makefile.host:103: arch/x86/boot/tools/build] Error 1
  make: *** [arch/x86/Makefile:303: bzImage] Error 2

The issue is with the _end variable that was added, to hold the end of
the compressed kernel from zoffsets.h (ZO__end). The name clashes with
the linker-defined _end symbol that indicates the end of the build
program itself.

Even when there is no compile-time error, this causes build to use
memory past the end of its .bss section.

To solve this, mark _end as static, and for symmetry, mark the rest of
the variables that keep track of symbols from the compressed kernel as
static as well.

Fixes: 964124a9 ("efi/x86: Remove extra headroom for setup block")
Reported-by: NMike Lothian <mike@fireburn.co.uk>
Tested-by: NMike Lothian <mike@fireburn.co.uk>
Signed-off-by: NArvind Sankar <nivedita@alum.mit.edu>
Link: https://lore.kernel.org/r/20200511225849.1311869-1-nivedita@alum.mit.eduSigned-off-by: NArd Biesheuvel <ardb@kernel.org>

e78d334a

13 5月, 2020 3 次提交

KVM: x86: Fix pkru save/restore when guest CR4.PKE=0, move it to x86.c · 37486135

由 Babu Moger 提交于 5月 12, 2020

Though rdpkru and wrpkru are contingent upon CR4.PKE, the PKRU
resource isn't. It can be read with XSAVE and written with XRSTOR.
So, if we don't set the guest PKRU value here(kvm_load_guest_xsave_state),
the guest can read the host value.

In case of kvm_load_host_xsave_state, guest with CR4.PKE clear could
potentially use XRSTOR to change the host PKRU value.

While at it, move pkru state save/restore to common code and the
host_pkru field to kvm_vcpu_arch.  This will let SVM support protection keys.

Cc: stable@vger.kernel.org
Reported-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Message-Id: <158932794619.44260.14508381096663848853.stgit@naples-babu.amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

37486135

x86/hyperv: Properly suspend/resume reenlightenment notifications · 38dce419

由 Vitaly Kuznetsov 提交于 5月 12, 2020

Errors during hibernation with reenlightenment notifications enabled were
reported:

 [   51.730435] PM: hibernation entry
 [   51.737435] PM: Syncing filesystems ...
 ...
 [   54.102216] Disabling non-boot CPUs ...
 [   54.106633] smpboot: CPU 1 is now offline
 [   54.110006] unchecked MSR access error: WRMSR to 0x40000106 (tried to
     write 0x47c72780000100ee) at rIP: 0xffffffff90062f24
     native_write_msr+0x4/0x20)
 [   54.110006] Call Trace:
 [   54.110006]  hv_cpu_die+0xd9/0xf0
 ...

Normally, hv_cpu_die() just reassigns reenlightenment notifications to some
other CPU when the CPU receiving them goes offline. Upon hibernation, there
is no other CPU which is still online so cpumask_any_but(cpu_online_mask)
returns >= nr_cpu_ids and using it as hv_vp_index index is incorrect.
Disable the feature when cpumask_any_but() fails.

Also, as we now disable reenlightenment notifications upon hibernation we
need to restore them on resume. Check if hv_reenlightenment_cb was
previously set and restore from hv_resume().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NDexuan Cui <decui@microsoft.com>
Reviewed-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20200512160153.134467-1-vkuznets@redhat.comSigned-off-by: NWei Liu <wei.liu@kernel.org>

38dce419

x86/ftrace: Have ftrace trampolines turn read-only at the end of system boot up · 59566b0b

由 Steven Rostedt (VMware) 提交于 4月 30, 2020

Booting one of my machines, it triggered the following crash:

 Kernel/User page tables isolation: enabled
 ftrace: allocating 36577 entries in 143 pages
 Starting tracer 'function'
 BUG: unable to handle page fault for address: ffffffffa000005c
 #PF: supervisor write access in kernel mode
 #PF: error_code(0x0003) - permissions violation
 PGD 2014067 P4D 2014067 PUD 2015063 PMD 7b253067 PTE 7b252061
 Oops: 0003 [#1] PREEMPT SMP PTI
 CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.0-test+ #24
 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS SDBLI944.86P 05/08/2007
 RIP: 0010:text_poke_early+0x4a/0x58
 Code: 34 24 48 89 54 24 08 e8 bf 72 0b 00 48 8b 34 24 48 8b 4c 24 08 84 c0 74 0b 48 89 df f3 a4 48 83 c4 10 5b c3 9c 58 fa 48 89 df <f3> a4 50 9d 48 83 c4 10 5b e9 d6 f9 ff ff
0 41 57 49
 RSP: 0000:ffffffff82003d38 EFLAGS: 00010046
 RAX: 0000000000000046 RBX: ffffffffa000005c RCX: 0000000000000005
 RDX: 0000000000000005 RSI: ffffffff825b9a90 RDI: ffffffffa000005c
 RBP: ffffffffa000005c R08: 0000000000000000 R09: ffffffff8206e6e0
 R10: ffff88807b01f4c0 R11: ffffffff8176c106 R12: ffffffff8206e6e0
 R13: ffffffff824f2440 R14: 0000000000000000 R15: ffffffff8206eac0
 FS:  0000000000000000(0000) GS:ffff88807d400000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: ffffffffa000005c CR3: 0000000002012000 CR4: 00000000000006b0
 Call Trace:
  text_poke_bp+0x27/0x64
  ? mutex_lock+0x36/0x5d
  arch_ftrace_update_trampoline+0x287/0x2d5
  ? ftrace_replace_code+0x14b/0x160
  ? ftrace_update_ftrace_func+0x65/0x6c
  __register_ftrace_function+0x6d/0x81
  ftrace_startup+0x23/0xc1
  register_ftrace_function+0x20/0x37
  func_set_flag+0x59/0x77
  __set_tracer_option.isra.19+0x20/0x3e
  trace_set_options+0xd6/0x13e
  apply_trace_boot_options+0x44/0x6d
  register_tracer+0x19e/0x1ac
  early_trace_init+0x21b/0x2c9
  start_kernel+0x241/0x518
  ? load_ucode_intel_bsp+0x21/0x52
  secondary_startup_64+0xa4/0xb0

I was able to trigger it on other machines, when I added to the kernel
command line of both "ftrace=function" and "trace_options=func_stack_trace".

The cause is the "ftrace=function" would register the function tracer
and create a trampoline, and it will set it as executable and
read-only. Then the "trace_options=func_stack_trace" would then update
the same trampoline to include the stack tracer version of the function
tracer. But since the trampoline already exists, it updates it with
text_poke_bp(). The problem is that text_poke_bp() called while
system_state == SYSTEM_BOOTING, it will simply do a memcpy() and not
the page mapping, as it would think that the text is still read-write.
But in this case it is not, and we take a fault and crash.

Instead, lets keep the ftrace trampolines read-write during boot up,
and then when the kernel executable text is set to read-only, the
ftrace trampolines get set to read-only as well.

Link: https://lkml.kernel.org/r/20200430202147.4dc6e2de@oasis.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable@vger.kernel.org
Fixes: 768ae440 ("x86/ftrace: Use text_poke()")
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

59566b0b

08 5月, 2020 7 次提交

KVM: SVM: Disable AVIC before setting V_IRQ · 7d611233

由 Suravee Suthikulpanit 提交于 5月 06, 2020

The commit 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window
while running L2 with V_INTR_MASKING=1") introduced a WARN_ON,
which checks if AVIC is enabled when trying to set V_IRQ
in the VMCB for enabling irq window.

The following warning is triggered because the requesting vcpu
(to deactivate AVIC) does not get to process APICv update request
for itself until the next #vmexit.

WARNING: CPU: 0 PID: 118232 at arch/x86/kvm/svm/svm.c:1372 enable_irq_window+0x6a/0xa0 [kvm_amd]
 RIP: 0010:enable_irq_window+0x6a/0xa0 [kvm_amd]
 Call Trace:
  kvm_arch_vcpu_ioctl_run+0x6e3/0x1b50 [kvm]
  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
  ? _copy_to_user+0x26/0x30
  ? kvm_vm_ioctl+0xb3e/0xd90 [kvm]
  ? set_next_entity+0x78/0xc0
  kvm_vcpu_ioctl+0x236/0x610 [kvm]
  ksys_ioctl+0x8a/0xc0
  __x64_sys_ioctl+0x1a/0x20
  do_syscall_64+0x58/0x210
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes by sending APICV update request to all other vcpus, and
immediately update APIC for itself.
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lkml.org/lkml/2020/5/2/167
Fixes: 64b5bd27 ("KVM: nSVM: ignore L1 interrupt window while running L2 with V_INTR_MASKING=1")
Message-Id: <1588818939-54264-1-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7d611233

KVM: Introduce kvm_make_all_cpus_request_except() · 54163a34

由 Suravee Suthikulpanit 提交于 5月 06, 2020

This allows making request to all other vcpus except the one
specified in the parameter.
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <1588771076-73790-2-git-send-email-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54163a34

KVM: VMX: pass correct DR6 for GD userspace exit · 45981ded

由 Paolo Bonzini 提交于 5月 06, 2020

When KVM_EXIT_DEBUG is raised for the disabled-breakpoints case (DR7.GD),
DR6 was incorrectly copied from the value in the VM. Instead,
DR6.BD should be set in order to catch this case.

On AMD this does not need any special code because the processor triggers
a #DB exception that is intercepted. However, the testcase would fail
without the previous patch because both DR6.BS and DR6.BD would be set.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

45981ded

KVM: x86, SVM: isolate vcpu->arch.dr6 from vmcb->save.dr6 · d67668e9

由 Paolo Bonzini 提交于 5月 06, 2020

There are two issues with KVM_EXIT_DEBUG on AMD, whose root cause is the
different handling of DR6 on intercepted #DB exceptions on Intel and AMD.

On Intel, #DB exceptions transmit the DR6 value via the exit qualification
field of the VMCS, and the exit qualification only contains the description
of the precise event that caused a vmexit.

On AMD, instead the DR6 field of the VMCB is filled in as if the #DB exception
was to be injected into the guest.  This has two effects when guest debugging
is in use:

* the guest DR6 is clobbered

* the kvm_run->debug.arch.dr6 field can accumulate more debug events, rather
than just the last one that happened (the testcase in the next patch covers
this issue).

This patch fixes both issues by emulating, so to speak, the Intel behavior
on AMD processors.  The important observation is that (after the previous
patches) the VMCB value of DR6 is only ever observable from the guest is
KVM_DEBUGREG_WONT_EXIT is set.  Therefore we can actually set vmcb->save.dr6
to any value we want as long as KVM_DEBUGREG_WONT_EXIT is clear, which it
will be if guest debugging is enabled.

Therefore it is possible to enter the guest with an all-zero DR6,
reconstruct the #DB payload from the DR6 we get at exit time, and let
kvm_deliver_exception_payload move the newly set bits into vcpu->arch.dr6.
Some extra bits may be included in the payload if KVM_DEBUGREG_WONT_EXIT
is set, but this is harmless.

This may not be the most optimized way to deal with this, but it is
simple and, being confined within SVM code, it gets rid of the set_dr6
callback and kvm_update_dr6.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d67668e9

KVM: SVM: keep DR6 synchronized with vcpu->arch.dr6 · 5679b803

由 Paolo Bonzini 提交于 5月 04, 2020

kvm_x86_ops.set_dr6 is only ever called with vcpu->arch.dr6 as the
second argument. Ensure that the VMCB value is synchronized to
vcpu->arch.dr6 on #DB (both "normal" and nested) and nested vmentry, so
that the current value of DR6 is always available in vcpu->arch.dr6.
The get_dr6 callback can just access vcpu->arch.dr6 and becomes redundant.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5679b803

crypto: lib/sha1 - remove unnecessary includes of linux/cryptohash.h · 2aaba014

由 Eric Biggers 提交于 5月 02, 2020

<linux/cryptohash.h> sounds very generic and important, like it's the
header to include if you're doing cryptographic hashing in the kernel.
But actually it only includes the library implementation of the SHA-1
compression function (not even the full SHA-1).  This should basically
never be used anymore; SHA-1 is no longer considered secure, and there
are much better ways to do cryptographic hashing in the kernel.

Most files that include this header don't actually need it.  So in
preparation for removing it, remove all these unneeded includes of it.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>

2aaba014

arch/x86/kvm/svm/sev.c: change flag passed to GUP fast in sev_pin_memory() · 996ed22c

由 Janakarajan Natarajan 提交于 5月 07, 2020

When trying to lock read-only pages, sev_pin_memory() fails because
FOLL_WRITE is used as the flag for get_user_pages_fast().

Commit 73b0140b ("mm/gup: change GUP fast to use flags rather than a
write 'bool'") updated the get_user_pages_fast() call sites to use
flags, but incorrectly updated the call in sev_pin_memory().  As the
original coding of this call was correct, revert the change made by that
commit.

Fixes: 73b0140b ("mm/gup: change GUP fast to use flags rather than a write 'bool'")
Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NIra Weiny <ira.weiny@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Link: http://lkml.kernel.org/r/20200423152419.87202-1-Janakarajan.Natarajan@amd.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

996ed22c

07 5月, 2020 8 次提交

P
KVM: nSVM: trap #DB and #BP to userspace if guest debugging is on · 2c19dba6
由 Paolo Bonzini 提交于 5月 07, 2020
```
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
2c19dba6

KVM: X86: Fix single-step with KVM_SET_GUEST_DEBUG · d5d260c5

由 Peter Xu 提交于 5月 05, 2020

When single-step triggered with KVM_SET_GUEST_DEBUG, we should fill in the pc
value with current linear RIP rather than the cached singlestep address.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200505205000.188252-3-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d5d260c5

KVM: X86: Set RTM for DB_VECTOR too for KVM_EXIT_DEBUG · 13196638

由 Peter Xu 提交于 5月 05, 2020

RTM should always been set even with KVM_EXIT_DEBUG on #DB.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200505205000.188252-2-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

13196638

KVM: x86: fix DR6 delivery for various cases of #DB injection · 4d5523cf

由 Paolo Bonzini 提交于 5月 05, 2020

Go through kvm_queue_exception_p so that the payload is correctly delivered
through the exit qualification, and add a kvm_update_dr6 call to
kvm_deliver_exception_payload that is needed on AMD.
Reported-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d5523cf

KVM: X86: Declare KVM_CAP_SET_GUEST_DEBUG properly · b9b2782c

由 Peter Xu 提交于 5月 05, 2020

KVM_CAP_SET_GUEST_DEBUG should be supported for x86 however it's not declared
as supported.  My wild guess is that userspaces like QEMU are using "#ifdef
KVM_CAP_SET_GUEST_DEBUG" to check for the capability instead, but that could be
wrong because the compilation host may not be the runtime host.

The userspace might still want to keep the old "#ifdef" though to not break the
guest debug on old kernels.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20200505154750.126300-1-peterx@redhat.com>
[Do the same for PPC and s390. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9b2782c

x86/resctrl: Support wider MBM counters · 0c4d5ba1

由 Reinette Chatre 提交于 5月 05, 2020

The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the
IA32_QM_CTR MSR while the first-generation MBM implementation
uses statically defined 24 bit counters.

The MBM CPUID enumeration properties have been expanded to include
the MBM counter width, encoded as an offset from 24 bits.

While eight bits are available for the counter width offset IA32_QM_CTR
MSR only supports 62 bit counters. Add a sanity check, with warning
printed when encountered, to ensure counters cannot exceed the 62 bit
limit.
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/69d52abd5b14794d3a0f05ba7c755ed1f4c0d5ed.1588715690.git.reinette.chatre@intel.com

0c4d5ba1

x86/resctrl: Support CPUID enumeration of MBM counter width · f3d44f18

由 Reinette Chatre 提交于 5月 05, 2020

The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the
IA32_QM_CTR MSR while the first-generation MBM implementation
uses statically defined 24 bit counters.

Expand the MBM CPUID enumeration properties to include the MBM
counter width. The previously undefined EAX output register contains,
in bits [7:0], the MBM counter width encoded as an offset from
24 bits. Enumerating this property is only specified for Intel
CPUs.
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/afa3af2f753f6bc301fb743bc8944e749cb24afa.1588715690.git.reinette.chatre@intel.com

f3d44f18

x86/resctrl: Maintain MBM counter width per resource · 46637d45

由 Reinette Chatre 提交于 5月 05, 2020

The original Memory Bandwidth Monitoring (MBM) architectural
definition defines counters of up to 62 bits in the IA32_QM_CTR MSR,
and the first-generation MBM implementation uses 24 bit counters.
Software is required to poll at 1 second or faster to ensure that
data is retrieved before a counter rollover occurs more than once
under worst conditions.

As system bandwidths scale the software requirement is maintained with
the introduction of a per-resource enumerable MBM counter width.

In preparation for supporting hardware with an enumerable MBM counter
width the current globally static MBM counter width is moved to a
per-resource MBM counter width. Currently initialized to 24 always
to result in no functional change.

In essence there is one function, mbm_overflow_count() that needs to
know the counter width to handle rollovers. The static value
used within mbm_overflow_count() will be replaced with a value
discovered from the hardware. Support for learning the MBM counter
width from hardware is added in the change that follows.
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/e36743b9800f16ce600f86b89127391f61261f23.1588715690.git.reinette.chatre@intel.com

46637d45

06 5月, 2020 2 次提交

x86/resctrl: Query LLC monitoring properties once during boot · 923f3a2b

由 Reinette Chatre 提交于 5月 05, 2020

Cache and memory bandwidth monitoring are features that are part of
x86 CPU resource control that is supported by the resctrl subsystem.
The monitoring properties are obtained via CPUID from every CPU
and only used within the resctrl subsystem where the properties are
only read from boot_cpu_data.

Obtain the monitoring properties once, placed in boot_cpu_data, via the
->c_bsp_init() helpers of the vendors that support X86_FEATURE_CQM_LLC.
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/6d74a6ac3e69f4b7a8b4115835f9455faf0f468d.1588715690.git.reinette.chatre@intel.com

923f3a2b

x86/resctrl: Remove unnecessary RMID checks · f0d339db

由 Reinette Chatre 提交于 5月 05, 2020

The cache and memory bandwidth monitoring properties are read using
CPUID on every CPU. After the information is read from the system a
sanity check is run to

 (1) ensure that the RMID data is initialized for the boot CPU in case
     the information was not available on the boot CPU and

 (2) the boot CPU's RMID is set to the minimum of RMID obtained
     from all CPUs.

Every known platform that supports resctrl has the same maximum RMID
on all CPUs. Both sanity checks found in x86_init_cache_qos() can thus
safely be removed.
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/c9a3b60d34091840c8b0bd1c6fab15e5ba92cb17.1588715690.git.reinette.chatre@intel.com

f0d339db

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功