提交 · 87d5986345219a7e4f204726d9085ea87f3e22d0 · openeuler / Kernel

23 11月, 2020 1 次提交

s390/mm: remove set_fs / rework address space handling · 87d59863

由 Heiko Carstens 提交于 11月 16, 2020

Remove set_fs support from s390. With doing this rework address space
handling and simplify it. As a result address spaces are now setup
like this:

CPU running in              | %cr1 ASCE | %cr7 ASCE | %cr13 ASCE
----------------------------|-----------|-----------|-----------
user space                  |  user     |  user     |  kernel
kernel, normal execution    |  kernel   |  user     |  kernel
kernel, kvm guest execution |  gmap     |  user     |  kernel

To achieve this the getcpu vdso syscall is removed in order to avoid
secondary address mode and a separate vdso address space in for user
space. The getcpu vdso syscall will be implemented differently with a
subsequent patch.

The kernel accesses user space always via secondary address space.
This happens in different ways:
- with mvcos in home space mode and directly read/write to secondary
  address space
- with mvcs/mvcp in primary space mode and copy from primary space to
  secondary space or vice versa
- with e.g. cs in secondary space mode and access secondary space

Switching translation modes happens with sacf before and after
instructions which access user space, like before.

Lazy handling of control register reloading is removed in the hope to
make everything simpler, but at the cost of making kernel entry and
exit a bit slower. That is: on kernel entry the primary asce is always
changed to contain the kernel asce, and on kernel exit the primary
asce is changed again so it contains the user asce.

In kernel mode there is only one exception to the primary asce: when
kvm guests are executed the primary asce contains the gmap asce (which
describes the guest address space). The primary asce is reset to
kernel asce whenever kvm guest execution is interrupted, so that this
doesn't has to be taken into account for any user space accesses.
Reviewed-by: NSven Schnelle <svens@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

87d59863

21 11月, 2020 3 次提交

s390/vmem: make variable and function names consistent · 12bb4c68

由 Alexander Gordeev 提交于 11月 10, 2020

Rename some variable and functions to better clarify
what they are and what they do.
Signed-off-by: NAlexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

12bb4c68

s390/vmem: remove redundant check · af71657c

由 Alexander Gordeev 提交于 11月 10, 2020

Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAlexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

af71657c

s390: unify identity mapping limits handling · 73045a08

由 Vasily Gorbik 提交于 10月 19, 2020

Currently we have to consider too many different values which
in the end only affect identity mapping size. These are:
1. max_physmem_end - end of physical memory online or standby.
   Always <= end of the last online memory block (get_mem_detect_end()).
2. CONFIG_MAX_PHYSMEM_BITS - the maximum size of physical memory the
   kernel is able to support.
3. "mem=" kernel command line option which limits physical memory usage.
4. OLDMEM_BASE which is a kdump memory limit when the kernel is executed as
   crash kernel.
5. "hsa" size which is a memory limit when the kernel is executed during
   zfcp/nvme dump.

Through out kernel startup and run we juggle all those values at once
but that does not bring any amusement, only confusion and complexity.

Unify all those values to a single one we should really care, that is
our identity mapping size.
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Reviewed-by: NAlexander Gordeev <agordeev@linux.ibm.com>
Acked-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

73045a08

09 11月, 2020 3 次提交

s390/kasan: remove obvious parameter with the only possible value · 54b52981

由 Vasily Gorbik 提交于 10月 05, 2020

Kasan early code is only working on init_mm, remove unneeded pgd
parameter from kasan_copy_shadow and rename it to
kasan_copy_shadow_mapping.
Reviewed-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

54b52981

s390/kasan: avoid confusing naming · 92bca2fe

由 Vasily Gorbik 提交于 10月 05, 2020

Kasan has nothing to do with vmemmap, strip vmemmap from function names
to avoid confusing people.
Reviewed-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

92bca2fe

s390/kasan: remove 3-level paging support · a3453d92

由 Vasily Gorbik 提交于 10月 15, 2020

Compiling the kernel with Kasan disables automatic 3-level vs 4-level
kernel space paging selection, because the shadow memory offset has
to be known at compile time and there is no such offset which would be
acceptable for both 3 and 4-level paging. Instead S390_4_LEVEL_PAGING
option was introduced which allowed to pick how many paging levels to
use under Kasan.

With the introduction of protected virtualization, kernel memory layout
may be affected due to ultravisor secure storage limit. This adds
additional complexity into how memory layout would look like in
combination with Kasan predefined shadow memory offsets. To simplify
this make Kasan 4-level paging default and remove Kasan 3-level paging
support.
Suggested-by: NHeiko Carstens <hca@linux.ibm.com>
Reviewed-by: NAlexander Gordeev <agordeev@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

a3453d92

26 10月, 2020 1 次提交

treewide: Convert macro and uses of __section(foo) to __section("foo") · 33def849

由 Joe Perches 提交于 10月 21, 2020

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.plSigned-off-by: NJoe Perches <joe@perches.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: NMiguel Ojeda <ojeda@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33def849

21 10月, 2020 1 次提交

s390: virtio: PV needs VIRTIO I/O device protection · 4ce1cf7b

由 Pierre Morel 提交于 9月 10, 2020

If protected virtualization is active on s390, VIRTIO has only retricted
access to the guest memory.
Define CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS and export
arch_has_restricted_virtio_memory_access to advertize VIRTIO if that's
the case.
Signed-off-by: NPierre Morel <pmorel@linux.ibm.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Reviewed-by: NHalil Pasic <pasic@linux.ibm.com>
Link: https://lore.kernel.org/r/1599728030-17085-3-git-send-email-pmorel@linux.ibm.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>

4ce1cf7b

14 10月, 2020 2 次提交

arch, drivers: replace for_each_membock() with for_each_mem_range() · b10d6bca

由 Mike Rapoport 提交于 10月 13, 2020

There are several occurrences of the following pattern:

	for_each_memblock(memory, reg) {
		start = __pfn_to_phys(memblock_region_memory_base_pfn(reg);
		end = __pfn_to_phys(memblock_region_memory_end_pfn(reg));

		/* do something with start and end */
	}

Using for_each_mem_range() iterator is more appropriate in such cases and
allows simpler and cleaner code.

[akpm@linux-foundation.org: fix arch/arm/mm/pmsa-v7.c build]
[rppt@linux.ibm.com: mips: fix cavium-octeon build caused by memblock refactoring]
  Link: http://lkml.kernel.org/r/20200827124549.GD167163@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-13-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b10d6bca

arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() · c9118e6c

由 Mike Rapoport 提交于 10月 13, 2020

There are several occurrences of the following pattern:

	for_each_memblock(memory, reg) {
		start_pfn = memblock_region_memory_base_pfn(reg);
		end_pfn = memblock_region_memory_end_pfn(reg);

		/* do something with start_pfn and end_pfn */
	}

Rather than iterate over all memblock.memory regions and each time query
for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
simpler and clearer code.
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c9118e6c

16 9月, 2020 3 次提交

s390/kasan: support protvirt with 4-level paging · c360c9a2

由 Vasily Gorbik 提交于 9月 11, 2020

Currently the kernel crashes in Kasan instrumentation code if
CONFIG_KASAN_S390_4_LEVEL_PAGING is used on protected virtualization
capable machine where the ultravisor imposes addressing limitations on
the host and those limitations are lower then KASAN_SHADOW_OFFSET.

The problem is that Kasan has to know in advance where vmalloc/modules
areas would be. With protected virtualization enabled vmalloc/modules
areas are moved down to the ultravisor secure storage limit while kasan
still expects them at the very end of 4-level paging address space.

To fix that make Kasan recognize when protected virtualization is enabled
and predefine vmalloc/modules areas position which are compliant with
ultravisor secure storage limit.

Kasan shadow itself stays in place and might reside above that ultravisor
secure storage limit.

One slight difference compaired to a kernel without Kasan enabled is that
vmalloc/modules areas position is not reverted to default if ultravisor
initialization fails. It would still be below the ultravisor secure
storage limit.

Kernel layout with kasan, 4-level paging and protected virtualization
enabled (ultravisor secure storage limit is at 0x0000800000000000):
---[ vmemmap Area Start ]---
0x0000400000000000-0x0000400080000000
---[ vmemmap Area End ]---
---[ vmalloc Area Start ]---
0x00007fe000000000-0x00007fff80000000
---[ vmalloc Area End ]---
---[ Modules Area Start ]---
0x00007fff80000000-0x0000800000000000
---[ Modules Area End ]---
---[ Kasan Shadow Start ]---
0x0018000000000000-0x001c000000000000
---[ Kasan Shadow End ]---
0x001c000000000000-0x0020000000000000         1P PGD I

Kernel layout with kasan, 4-level paging and protected virtualization
disabled/unsupported:
---[ vmemmap Area Start ]---
0x0000400000000000-0x0000400060000000
---[ vmemmap Area End ]---
---[ Kasan Shadow Start ]---
0x0018000000000000-0x001c000000000000
---[ Kasan Shadow End ]---
---[ vmalloc Area Start ]---
0x001fffe000000000-0x001fffff80000000
---[ vmalloc Area End ]---
---[ Modules Area Start ]---
0x001fffff80000000-0x0020000000000000
---[ Modules Area End ]---
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

c360c9a2

s390/mm,ptdump: sort markers · ee4b2ce6

由 Vasily Gorbik 提交于 9月 10, 2020

Kasan configuration options and size of physical memory present could
affect kernel memory layout. In particular vmemmap, vmalloc and modules
might come before kasan shadow or after it. To make ptdump correctly
output markers in the right order markers have to be sorted.

To preserve the original order of markers with the same start address
avoid using sort() from lib/sort.c (which is not stable sorting algorithm)
and sort markers in place.
Reviewed-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

ee4b2ce6

s390/mm,ptdump: add proper ifdefs · 48111b48

由 Heiko Carstens 提交于 9月 15, 2020

Use ifdefs instead of IS_ENABLED() to avoid compile error
for !PTDUMP_DEBUGFS:

arch/s390/mm/dump_pagetables.c: In function ‘pt_dump_init’:
arch/s390/mm/dump_pagetables.c:248:64: error: ‘ptdump_fops’ undeclared (first use in this function); did you mean ‘pidfd_fops’?
debugfs_create_file("kernel_page_tables", 0400, NULL, NULL, &ptdump_fops);
Reported-by: NJulian Wiedmann <jwi@linux.ibm.com>
Fixes: 08c8e685 ("s390: add ARCH_HAS_DEBUG_WX support")
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

48111b48

14 9月, 2020 10 次提交

s390/uv: add destroy page call · 1a80b54d

由 Janosch Frank 提交于 9月 07, 2020

We don't need to export pages if we destroy the VM configuration
afterwards anyway. Instead we can destroy the page which will zero it
and then make it accessible to the host.

Destroying is about twice as fast as the export.
Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>
Reviewed-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: NThomas Huth <thuth@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Link: https://lore.kernel.org/kvm/20200907124700.10374-2-frankja@linux.ibm.com/Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

1a80b54d

s390/mm,ptdump: add couple of additional markers · e670e64a

由 Vasily Gorbik 提交于 9月 11, 2020

Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
[hca@linux.ibm.com: add more markers, rename some markers]
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

e670e64a

s390/kasan: make shadow memory noexec · d411e3c6

由 Vasily Gorbik 提交于 9月 10, 2020

ARCH_HAS_DEBUG_WX feature support brought attention to the fact that
currently initial kasan shadow memory mapped without noexec flag. So fix that.

Temporary initial identity mapping is still created without noexec, but
it is replaced by properly set up paging later.
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

d411e3c6

s390: add ARCH_HAS_DEBUG_WX support · 08c8e685

由 Heiko Carstens 提交于 9月 09, 2020

Checks the whole kernel address space for W+X mappings. Note that
currently the first lowcore page unfortunately has to be mapped
W+X. Therefore this not reported as an insecure mapping.

For the very same reason the wording is also different to other
architectures if the test passes:

On s390 it is "no unexpected W+X pages found" instead of
"no W+X pages found".
Tested-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

08c8e685

s390/mm,ptdump: make page table dumping seq_file optional · 6bf9a639

由 Heiko Carstens 提交于 9月 09, 2020

s390 version of ae5d1cf3 ("arm64: dump: Make the page table
dumping seq_file optional").
Tested-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

6bf9a639

s390/mm,ptdump: hold cpa mutex while walking for kernel page table dump · da1694ad

由 Heiko Carstens 提交于 9月 07, 2020

This is currently only preventing that outdated information is
provided to user space. A concurrent split of huge/large pages does
modify the kernel page tables, however either the huge/large mapping
is reported or the split area is being walked.

This "fixes" also only a potential future bug, since split pages could
also be merged again if page permissions are the same for larger
memory areas.
Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

da1694ad

s390/mm,ptdump: hold memory hotplug lock while walking for kernel page table dump · 36c2733c

由 Heiko Carstens 提交于 9月 07, 2020

This is the s390 variant of commit bf2b59f6 ("arm64/mm: Hold
memory hotplug lock while walking for kernel page table dump").

Right now this doesn't fix any real bug, however as soon as kvm
patches get merged which make use of memory remove we might end up
dereferencing/accessing freed page tables.

Therefore fix this potential bug already now.
Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

36c2733c

s390/mm,ptdump: convert to generic page table dumper · 9d719d39

由 Heiko Carstens 提交于 9月 04, 2020

Make use of generic ptdump infrastructure.
Reviewed-by: NVasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

9d719d39

s390/pci: Implement ioremap_wc/prot() with MIO · b02002cc

由 Niklas Schnelle 提交于 7月 13, 2020

With our current support for the new MIO PCI instructions, write
combining/write back MMIO memory can be obtained via the pci_iomap_wc()
and pci_iomap_wc_range() functions.
This is achieved by using the write back address for a specific bar
as provided in clp_store_query_pci_fn()

These functions are however not widely used and instead drivers often
rely on ioremap_wc() and ioremap_prot(), which on other platforms enable
write combining using a PTE flag set through the pgrprot value.

While we do not have a write combining flag in the low order flag bits
of the PTE like x86_64 does, with MIO support, there is a write back bit
in the physical address (bit 1 on z15) and thus also the PTE.
Which bit is used to toggle write back and whether it is available at
all, is however not fixed in the architecture. Instead we get this
information from the CLP Store Logical Processor Characteristics for PCI
command. When the write back bit is not provided we fall back to the
existing behavior.
Signed-off-by: NNiklas Schnelle <schnelle@linux.ibm.com>
Reviewed-by: NPierre Morel <pmorel@linux.ibm.com>
Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

b02002cc

s390: add 3f program exception handler · cd4d3d5f

由 Janosch Frank 提交于 9月 08, 2020

Program exception 3f (secure storage violation) can only be detected
when the CPU is running in SIE with a format 4 state description,
e.g. running a protected guest. Because of this and because user
space partly controls the guest memory mapping and can trigger this
exception, we want to send a SIGSEGV to the process running the guest
and not panic the kernel.
Signed-off-by: NJanosch Frank <frankja@linux.ibm.com>
Cc: <stable@vger.kernel.org> # 5.7
Fixes: 084ea4d6 ("s390/mm: add (non)secure page access exceptions handlers")
Reviewed-by: NClaudio Imbrenda <imbrenda@linux.ibm.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

cd4d3d5f

27 8月, 2020 1 次提交

s390/vmem: fix vmem_add_range for 4-level paging · bffc2f7a

由 Vasily Gorbik 提交于 8月 21, 2020

The kernel currently crashes if 4-level paging is used. Add missing
p4d_populate for just allocated pud entry.

Fixes: 3e0d3e40 ("s390/vmem: consolidate vmem_add_range() and vmem_remove_range()")
Reviewed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: NVasily Gorbik <gor@linux.ibm.com>

bffc2f7a

13 8月, 2020 3 次提交

mm/gup: remove task_struct pointer for all gup code · 64019a2e

由 Peter Xu 提交于 8月 11, 2020

After the cleanup of page fault accounting, gup does not need to pass
task_struct around any more. Remove that parameter in the whole gup
stack.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-26-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

64019a2e

mm/s390: use general page fault accounting · 35e45f3e

由 Peter Xu 提交于 8月 11, 2020

Use the general page fault accounting by passing regs into
handle_mm_fault().  It naturally solve the issue of multiple page fault
accounting when page fault retry happened.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Link: http://lkml.kernel.org/r/20200707225021.200906-19-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

35e45f3e

mm: do page fault accounting in handle_mm_fault · bce617ed

由 Peter Xu 提交于 8月 11, 2020

Patch series "mm: Page fault accounting cleanups", v5.

This is v5 of the pf accounting cleanup series.  It originates from Gerald
Schaefer's report on an issue a week ago regarding to incorrect page fault
accountings for retried page fault after commit 4064b982 ("mm: allow
VM_FAULT_RETRY for multiple times"):

  https://lore.kernel.org/lkml/20200610174811.44b94525@thinkpad/

What this series did:

  - Correct page fault accounting: we do accounting for a page fault
    (no matter whether it's from #PF handling, or gup, or anything else)
    only with the one that completed the fault.  For example, page fault
    retries should not be counted in page fault counters.  Same to the
    perf events.

  - Unify definition of PERF_COUNT_SW_PAGE_FAULTS: currently this perf
    event is used in an adhoc way across different archs.

    Case (1): for many archs it's done at the entry of a page fault
    handler, so that it will also cover e.g.  errornous faults.

    Case (2): for some other archs, it is only accounted when the page
    fault is resolved successfully.

    Case (3): there're still quite some archs that have not enabled
    this perf event.

    Since this series will touch merely all the archs, we unify this
    perf event to always follow case (1), which is the one that makes most
    sense.  And since we moved the accounting into handle_mm_fault, the
    other two MAJ/MIN perf events are well taken care of naturally.

  - Unify definition of "major faults": the definition of "major
    fault" is slightly changed when used in accounting (not
    VM_FAULT_MAJOR).  More information in patch 1.

  - Always account the page fault onto the one that triggered the page
    fault.  This does not matter much for #PF handlings, but mostly for
    gup.  More information on this in patch 25.

Patchset layout:

Patch 1:     Introduced the accounting in handle_mm_fault(), not enabled.
Patch 2-23:  Enable the new accounting for arch #PF handlers one by one.
Patch 24:    Enable the new accounting for the rest outliers (gup, iommu, etc.)
Patch 25:    Cleanup GUP task_struct pointer since it's not needed any more

This patch (of 25):

This is a preparation patch to move page fault accountings into the
general code in handle_mm_fault().  This includes both the per task
flt_maj/flt_min counters, and the major/minor page fault perf events.  To
do this, the pt_regs pointer is passed into handle_mm_fault().

PERF_COUNT_SW_PAGE_FAULTS should still be kept in per-arch page fault
handlers.

So far, all the pt_regs pointer that passed into handle_mm_fault() is
NULL, which means this patch should have no intented functional change.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Cain <bcain@codeaurora.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Greentime Hu <green.hu@gmail.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Nick Hu <nickhu@andestech.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vincent Chen <deanbo422@gmail.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/20200707225021.200906-1-peterx@redhat.com
Link: http://lkml.kernel.org/r/20200707225021.200906-2-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bce617ed

12 8月, 2020 1 次提交

s390/gmap: improve THP splitting · ba925fa3

由 Gerald Schaefer 提交于 7月 29, 2020

During s390_enable_sie(), we need to take care of splitting all qemu user
process THP mappings. This is currently done with follow_page(FOLL_SPLIT),
by simply iterating over all vma ranges, with PAGE_SIZE increment.

This logic is sub-optimal and can result in a lot of unnecessary overhead,
especially when using qemu and ASAN with large shadow map. Ilya reported
significant system slow-down with one CPU busy for a long time and overall
unresponsiveness.

Fix this by using walk_page_vma() and directly calling split_huge_pmd()
only for present pmds, which greatly reduces overhead.

Cc: <stable@vger.kernel.org> # v5.4+
Reported-by: NIlya Leoshkevich <iii@linux.ibm.com>
Tested-by: NIlya Leoshkevich <iii@linux.ibm.com>
Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

ba925fa3

08 8月, 2020 2 次提交

mm/sparse: cleanup the code surrounding memory_present() · c89ab04f

由 Mike Rapoport 提交于 8月 06, 2020

After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().

Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.

Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.

Also remove no longer required HAVE_MEMORY_PRESENT configuration option.
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c89ab04f

mm: remove unneeded includes of <asm/pgalloc.h> · ca15ca40

由 Mike Rapoport 提交于 8月 06, 2020

Patch series "mm: cleanup usage of <asm/pgalloc.h>"

Most architectures have very similar versions of pXd_alloc_one() and
pXd_free_one() for intermediate levels of page table.  These patches add
generic versions of these functions in <asm-generic/pgalloc.h> and enable
use of the generic functions where appropriate.

In addition, functions declared and defined in <asm/pgalloc.h> headers are
used mostly by core mm and early mm initialization in arch and there is no
actual reason to have the <asm/pgalloc.h> included all over the place.
The first patch in this series removes unneeded includes of
<asm/pgalloc.h>

In the end it didn't work out as neatly as I hoped and moving
pXd_alloc_track() definitions to <asm-generic/pgalloc.h> would require
unnecessary changes to arches that have custom page table allocations, so
I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
to mm/.

This patch (of 8):

In most cases <asm/pgalloc.h> header is required only for allocations of
page table memory.  Most of the .c files that include that header do not
use symbols declared in <asm/pgalloc.h> and do not require that header.

As for the other header files that used to include <asm/pgalloc.h>, it is
possible to move that include into the .c file that actually uses symbols
from <asm/pgalloc.h> and drop the include from the header file.

The process was somewhat automated using

	sed -i -E '/[<"]asm\/pgalloc\.h/d' \
                $(grep -L -w -f /tmp/xx \
                        $(git grep -E -l '[<"]asm/pgalloc\.h'))

where /tmp/xx contains all the symbols defined in
arch/*/include/asm/pgalloc.h.

[rppt@linux.ibm.com: fix powerpc warning]
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NPekka Enberg <penberg@kernel.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>	[m68k]
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ca15ca40

27 7月, 2020 9 次提交

H
s390/vmemmap: coding style updates · 9a996c67
由 Heiko Carstens 提交于 7月 23, 2020
```
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>
```
9a996c67

s390/vmemmap: avoid memset(PAGE_UNUSED) when adding consecutive sections · 2c114df0

由 David Hildenbrand 提交于 7月 22, 2020

Let's avoid memset(PAGE_UNUSED) when adding consecutive sections,
whereby the vmemmap of a single section does not span full PMDs.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-10-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

2c114df0

s390/vmemmap: remember unused sub-pmd ranges · cd5781d6

由 David Hildenbrand 提交于 7月 22, 2020

With a memmap size of 56 bytes or 72 bytes per page, the memmap for a
256 MB section won't span full PMDs. As we populate single sections and
depopulate single sections, the depopulation step would not be able to
free all vmemmap pmds anymore.

Do it similarly to x86, marking the unused memmap ranges in a special way
(pad it with 0xFD).

This allows us to add/remove sections, cleaning up all allocated
vmemmap pages even if the memmap size is not multiple of 16 bytes per page.

A 56 byte memmap can, for example, be created with !CONFIG_MEMCG and
!CONFIG_SLUB.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-9-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

cd5781d6

s390/vmemmap: fallback to PTEs if mapping large PMD fails · f2057b42

由 David Hildenbrand 提交于 7月 22, 2020

Let's fallback to single pages if short on huge pages. No need to stop
memory hotplug.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-8-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

f2057b42

s390/vmem: cleanup empty page tables · b9ff8100

由 David Hildenbrand 提交于 7月 22, 2020

Let's cleanup empty page tables. Consider only page tables that fully
fall into the idendity mapping and the vmemmap range.

As there are no valid accesses to vmem/vmemmap within non-populated ranges,
the single tlb flush at the end should be sufficient.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-7-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

b9ff8100

s390/vmemmap: take the vmem_mutex when populating/freeing · aa18e0e6

由 David Hildenbrand 提交于 7月 22, 2020

Let's synchronize all accesses to the 1:1 and vmemmap mappings. This will
be especially relevant when wanting to cleanup empty page tables that could
be shared by both. Avoid races when removing tables that might be just
about to get reused.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-6-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

aa18e0e6

s390/vmemmap: cleanup when vmemmap_populate() fails · c00f05a9

由 David Hildenbrand 提交于 7月 22, 2020

Cleanup what we partially added in case vmemmap_populate() fails. For
vmem, this is already handled by vmem_add_mapping().

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-5-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

c00f05a9

s390/vmemmap: extend modify_pagetable() to handle vmemmap · 9ec8fa8d

由 David Hildenbrand 提交于 7月 22, 2020

Extend our shiny new modify_pagetable() to handle !direct (vmemmap)
mappings. Convert vmemmap_populate() and implement vmemmap_free().

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-4-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

9ec8fa8d

s390/vmem: consolidate vmem_add_range() and vmem_remove_range() · 3e0d3e40

由 David Hildenbrand 提交于 7月 22, 2020

We want to have only a single pagetable walker and reuse the same
functionality for vmemmap handling. Let's start by consolidating
vmem_add_range() and vmem_remove_range(), converting it into a
recursive implementation.

A recursive implementation makes it easier to expand individual cases
without harming readability. In addition, we minimize traversing the
whole hierarchy over and over again.

One change is that we don't unmap large PMDs/PUDs when not completely
covered by the request, something that should never happen with direct
mappings, unless one would be removing in other granularity than added,
which would be broken already.

Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Message-Id: <20200722094558.9828-3-david@redhat.com>
Signed-off-by: NHeiko Carstens <hca@linux.ibm.com>

3e0d3e40

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功