1. 04 2月, 2020 5 次提交
    • S
      mm: ptdump: reduce level numbers by 1 in note_page() · f8f0d0b6
      Steven Price 提交于
      Rather than having to increment the 'depth' number by 1 in ptdump_hole(),
      let's change the meaning of 'level' in note_page() since that makes the
      code simplier.
      
      Note that for x86, the level numbers were previously increased by 1 in
      commit 45dcd209 ("x86/mm/dump_pagetables: Fix printout of p4d level")
      and the comment "Bit 7 has a different meaning" was not updated, so this
      change also makes the code match the comment again.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-24-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f0d0b6
    • S
      x86: mm: convert dump_pagetables to use walk_page_range · 2ae27137
      Steven Price 提交于
      Make use of the new functionality in walk_page_range to remove the arch
      page walking code and use the generic code to walk the page tables.
      
      The effective permissions are passed down the chain using new fields in
      struct pg_state.
      
      The KASAN optimisation is implemented by setting action=CONTINUE in the
      callbacks to skip an entire tree of entries.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-21-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae27137
    • S
      x86: mm: convert ptdump_walk_pgd_level_debugfs() to take an mm_struct · c5cfae12
      Steven Price 提交于
      To enable x86 to use the generic walk_page_range() function, the callers
      of ptdump_walk_pgd_level_debugfs() need to pass in the mm_struct.
      
      This means that ptdump_walk_pgd_level_core() is now always passed a valid
      pgd, so drop the support for pgd==NULL.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-19-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5cfae12
    • S
      x86: mm+efi: convert ptdump_walk_pgd_level() to take a mm_struct · e455248d
      Steven Price 提交于
      To enable x86 to use the generic walk_page_range() function, the callers
      of ptdump_walk_pgd_level() need to pass an mm_struct rather than the raw
      pgd_t pointer.  Luckily since commit 7e904a91 ("efi: Use efi_mm in x86
      as well as ARM") we now have an mm_struct for EFI on x86.
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-18-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e455248d
    • S
      x86: mm: point to struct seq_file from struct pg_state · 74d2aaa1
      Steven Price 提交于
      mm/dump_pagetables.c passes both struct seq_file and struct pg_state down
      the chain of walk_*_level() functions to be passed to note_page().
      Instead place the struct seq_file in struct pg_state and access it from
      struct pg_state (which is private to this file) in note_page().
      
      Link: http://lkml.kernel.org/r/20191218162402.45610-17-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Alexandre Ghiti <alex@ghiti.fr>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Liang, Kan" <kan.liang@linux.intel.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zong Li <zong.li@sifive.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74d2aaa1
  2. 28 1月, 2020 1 次提交
  3. 25 1月, 2020 1 次提交
  4. 24 1月, 2020 2 次提交
    • D
      x86/mpx: remove MPX from arch/x86 · 45fc24e8
      Dave Hansen 提交于
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      This removes all the remaining (dead at this point) MPX handling
      code remaining in the tree.  The only remaining code is the XSAVE
      support for MPX state which is currently needd for KVM to handle
      VMs which might use MPX.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      45fc24e8
    • D
      x86/mpx: remove build infrastructure · 4ba68d00
      Dave Hansen 提交于
      From: Dave Hansen <dave.hansen@linux.intel.com>
      
      MPX is being removed from the kernel due to a lack of support
      in the toolchain going forward (gcc).
      
      Remove the Kconfig option and the Makefile line.  This makes
      arch/x86/mm/mpx.c and anything under an #ifdef for
      X86_INTEL_MPX dead code.
      
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: x86@kernel.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      4ba68d00
  5. 20 1月, 2020 1 次提交
  6. 07 1月, 2020 1 次提交
    • F
      x86/context-tracking: Remove exception_enter/exit() from do_page_fault() · ee6352b2
      Frederic Weisbecker 提交于
      do_page_fault(), like other exceptions, is already covered by
      user_enter() and user_exit() when the exception triggers in userspace.
      
      As explained in:
      
        8c84014f ("x86/entry: Remove exception_enter() from most trap handlers")
      
      exception_enter/exit() only remained to handle possible page fault from
      kernel mode while context tracking is in CONTEXT_USER mode, ie: on
      kernel entry before we manage to call user_exit(). The only known
      offender was do_fast_syscall_32() fetching EBP register from where
      vDSO stashed it.
      
      Meanwhile this got fixed in:
      
        9999c8c0 ("x86/entry: Call enter_from_user_mode() with IRQs off")
      
      that moved enter_from_user_mode() before the call to get_user().
      
      So we can safely remove it now.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Link: https://lkml.kernel.org/r/20191227163612.10039-2-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ee6352b2
  7. 06 1月, 2020 1 次提交
  8. 05 1月, 2020 1 次提交
    • D
      mm/memory_hotplug: shrink zones when offlining memory · feee6b29
      David Hildenbrand 提交于
      We currently try to shrink a single zone when removing memory.  We use
      the zone of the first page of the memory we are removing.  If that
      memmap was never initialized (e.g., memory was never onlined), we will
      read garbage and can trigger kernel BUGs (due to a stale pointer):
      
          BUG: unable to handle page fault for address: 000000000000353d
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 1 PID: 7 Comm: kworker/u8:0 Not tainted 5.3.0-rc5-next-20190820+ #317
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:clear_zone_contiguous+0x5/0x10
          Code: 48 89 c6 48 89 c3 e8 2a fe ff ff 48 85 c0 75 cf 5b 5d c3 c6 85 fd 05 00 00 01 5b 5d c3 0f 1f 840
          RSP: 0018:ffffad2400043c98 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000200000000 RCX: 0000000000000000
          RDX: 0000000000200000 RSI: 0000000000140000 RDI: 0000000000002f40
          RBP: 0000000140000000 R08: 0000000000000000 R09: 0000000000000001
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000
          R13: 0000000000140000 R14: 0000000000002f40 R15: ffff9e3e7aff3680
          FS:  0000000000000000(0000) GS:ffff9e3e7bb00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 000000000000353d CR3: 0000000058610000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           __remove_pages+0x4b/0x640
           arch_remove_memory+0x63/0x8d
           try_remove_memory+0xdb/0x130
           __remove_memory+0xa/0x11
           acpi_memory_device_remove+0x70/0x100
           acpi_bus_trim+0x55/0x90
           acpi_device_hotplug+0x227/0x3a0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x221/0x550
           worker_thread+0x50/0x3b0
           kthread+0x105/0x140
           ret_from_fork+0x3a/0x50
          Modules linked in:
          CR2: 000000000000353d
      
      Instead, shrink the zones when offlining memory or when onlining failed.
      Introduce and use remove_pfn_range_from_zone(() for that.  We now
      properly shrink the zones, even if we have DIMMs whereby
      
       - Some memory blocks fall into no zone (never onlined)
      
       - Some memory blocks fall into multiple zones (offlined+re-onlined)
      
       - Multiple memory blocks that fall into different zones
      
      Drop the zone parameter (with a potential dubious value) from
      __remove_pages() and __remove_section().
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-6-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feee6b29
  9. 31 12月, 2019 1 次提交
    • J
      x86/kasan: Print original address on #GP · 2f004eea
      Jann Horn 提交于
      Make #GP exceptions caused by out-of-bounds KASAN shadow accesses easier
      to understand by computing the address of the original access and
      printing that. More details are in the comments in the patch.
      
      This turns an error like this:
      
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault, probably for non-canonical address
            0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
      
      into this:
      
        general protection fault, probably for non-canonical address
            0xe017577ddf75b7dd: 0000 [#1] PREEMPT SMP KASAN PTI
        KASAN: maybe wild-memory-access in range
            [0x00badbeefbadbee8-0x00badbeefbadbeef]
      
      The hook is placed in architecture-independent code, but is currently
      only wired up to the X86 exception handler because I'm not sufficiently
      familiar with the address space layout and exception handling mechanisms
      on other architectures.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: kasan-dev@googlegroups.com
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20191218231150.12139-4-jannh@google.com
      2f004eea
  10. 10 12月, 2019 11 次提交
    • I
      mm, x86/mm: Untangle address space layout definitions from basic pgtable type definitions · 186525bd
      Ingo Molnar 提交于
      - Untangle the somewhat incestous way of how VMALLOC_START is used all across the
        kernel, but is, on x86, defined deep inside one of the lowest level page table headers.
        It doesn't help that vmalloc.h only includes a single asm header:
      
           #include <asm/page.h>           /* pgprot_t */
      
        So there was no existing cross-arch way to decouple address layout
        definitions from page.h details. I used this:
      
         #ifndef VMALLOC_START
         # include <asm/vmalloc.h>
         #endif
      
        This way every architecture that wants to simplify page.h can do so.
      
      - Also on x86 we had a couple of LDT related inline functions that used
        the late-stage address space layout positions - but these could be
        uninlined without real trouble - the end result is cleaner this way as
        well.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      186525bd
    • K
    • I
      x86/mm/pat: Rename <asm/pat.h> => <asm/memtype.h> · eb243d1d
      Ingo Molnar 提交于
      pat.h is a file whose main purpose is to provide the memtype_*() APIs.
      
      PAT is the low level hardware mechanism - but the high level abstraction
      is memtype.
      
      So name the header <memtype.h> as well - this goes hand in hand with memtype.c
      and memtype_interval.c.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      eb243d1d
    • I
      x86/mm/pat: Standardize on memtype_*() prefix for APIs · ecdd6ee7
      Ingo Molnar 提交于
      Half of our memtype APIs are memtype_ prefixed, the other half are _memtype suffixed:
      
      	reserve_memtype()
      	free_memtype()
      	kernel_map_sync_memtype()
      	io_reserve_memtype()
      	io_free_memtype()
      
      	memtype_check_insert()
      	memtype_erase()
      	memtype_lookup()
      	memtype_copy_nth_element()
      
      Use prefixes consistently, like most other modern kernel APIs:
      
      	reserve_memtype()		=> memtype_reserve()
      	free_memtype()			=> memtype_free()
      	kernel_map_sync_memtype()	=> memtype_kernel_map_sync()
      	io_reserve_memtype()		=> memtype_reserve_io()
      	io_free_memtype()		=> memtype_free_io()
      
      	memtype_check_insert()		=> memtype_check_insert()
      	memtype_erase()			=> memtype_erase()
      	memtype_lookup()		=> memtype_lookup()
      	memtype_copy_nth_element()	=> memtype_copy_nth_element()
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ecdd6ee7
    • I
      x86/mm/pat: Move the memtype related files to arch/x86/mm/pat/ · f9b57cf8
      Ingo Molnar 提交于
      - pat.c offers, dominantly, the memtype APIs - so rename it to memtype.c.
      
      - pageattr.c is offering, primarily, the set_memory*() page attribute APIs,
        which is offered via the <asm/set_memory.h> header: name the .c file
        along the same pattern.
      
      I.e. perform these renames, and move them all next to each other in arch/x86/mm/pat/:
      
          pat.c             => memtype.c
          pat_internal.h    => memtype.h
          pat_interval.c    => memtype_interval.c
      
          pageattr.c        => set_memory.c
          pageattr-test.c   => cpa-test.c
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f9b57cf8
    • I
      x86/mm/pat: Clean up PAT initialization flags · d891b921
      Ingo Molnar 提交于
      Right now we have these variables that impact the PAT initialization sequence:
      
        pat_disabled
        boot_cpu_done
        pat_initialized
        init_cm_done
      
      Some have a pat_ prefix, some not, and the naming is random,
      which makes their purpose rather opaque.
      
      Name them consistently and according to their role:
      
        pat_disabled
        pat_bp_initialized
        pat_bp_enabled
        pat_cm_initialized
      
      Also rename pat_bsp_init() => pat_bp_init(), to use the canonical
      abbreviation.
      
      Also add a warning for double calls of init_cache_modes(), the call chains
      leading to this are complex and I couldn't convince myself that we never
      call this function twice - so utilize the flag for a debug check.
      
      No change in functionality intended.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d891b921
    • I
      x86/mm/pat: Harmonize 'struct memtype *' local variable and function parameter use · baf65855
      Ingo Molnar 提交于
      We have quite a zoo of 'struct memtype' variable nomenclature:
      
        new
        entry
        print_entry
        data
        match
        out
        memtype
      
      Beyond the randomness, some of these are outright confusing, especially
      when used in larger functions.
      
      Standardize them:
      
        entry
        entry_new
        entry_old
        entry_print
        entry_match
        entry_out
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      baf65855
    • I
      x86/mm/pat: Simplify the free_memtype() control flow · 47553d42
      Ingo Molnar 提交于
      Simplify/streamline the quirky handling of the pat_pagerange_is_ram() logic,
      and get rid of the 'err' local variable.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      47553d42
    • I
      x86/mm/pat: Create fixed width output in... · ef35b0fc
      Ingo Molnar 提交于
      x86/mm/pat: Create fixed width output in /sys/kernel/debug/x86/pat_memtype_list, similar to the E820 debug printouts
      
      Before:
      
        write-back @ 0xbdfa9000-0xbdfaa000
        write-back @ 0xbdfaa000-0xbdfab000
        write-back @ 0xbdfab000-0xbdfac000
        uncached-minus @ 0xc0000000-0xd0000000
        uncached-minus @ 0xd0900000-0xd0920000
        uncached-minus @ 0xd0920000-0xd0940000
        uncached-minus @ 0xd0940000-0xd0960000
        uncached-minus @ 0xd0960000-0xd0980000
        uncached-minus @ 0xd0980000-0xd0981000
        uncached-minus @ 0xd0990000-0xd0991000
        uncached-minus @ 0xd09a0000-0xd09a1000
        uncached-minus @ 0xd09b0000-0xd09b1000
        uncached-minus @ 0xd0d00000-0xd0d01000
        uncached-minus @ 0xd0d10000-0xd0d11000
        uncached-minus @ 0xd0d20000-0xd0d21000
        uncached-minus @ 0xd0d40000-0xd0d41000
        uncached-minus @ 0xfed00000-0xfed01000
        uncached-minus @ 0xfed1f000-0xfed20000
        uncached-minus @ 0xfed40000-0xfed41000
      
      After:
      
        PAT: [mem 0x00000000bdf8e000-0x00000000bdfa8000] write-back
        PAT: [mem 0x00000000bdfa9000-0x00000000bdfaa000] write-back
        PAT: [mem 0x00000000bdfaa000-0x00000000bdfab000] write-back
        PAT: [mem 0x00000000bdfab000-0x00000000bdfac000] write-back
        PAT: [mem 0x00000000c0000000-0x00000000d0000000] uncached-minus
        PAT: [mem 0x00000000d0900000-0x00000000d0920000] uncached-minus
        PAT: [mem 0x00000000d0920000-0x00000000d0940000] uncached-minus
        PAT: [mem 0x00000000d0940000-0x00000000d0960000] uncached-minus
        PAT: [mem 0x00000000d0960000-0x00000000d0980000] uncached-minus
        PAT: [mem 0x00000000d0980000-0x00000000d0981000] uncached-minus
        PAT: [mem 0x00000000d0990000-0x00000000d0991000] uncached-minus
        PAT: [mem 0x00000000d09a0000-0x00000000d09a1000] uncached-minus
        PAT: [mem 0x00000000d09b0000-0x00000000d09b1000] uncached-minus
        PAT: [mem 0x00000000fed1f000-0x00000000fed20000] uncached-minus
        PAT: [mem 0x00000000fed40000-0x00000000fed41000] uncached-minus
      
      The advantage is that it's easier to parse at a glance - and the tree is ordered
      by start address, which is now reflected in putting the start address in the first
      column.
      
      This is also now similar to how we print e820 entries:
      
        BIOS-e820: [mem 0x00000000bafda000-0x00000000bb3d3fff] usable
        BIOS-e820: [mem 0x00000000bb3d4000-0x00000000bdd2efff] reserved
        BIOS-e820: [mem 0x00000000bdd2f000-0x00000000bddccfff] ACPI NVS
        BIOS-e820: [mem 0x00000000bddcd000-0x00000000bdea0fff] ACPI data
        BIOS-e820: [mem 0x00000000bdea1000-0x00000000bdf2efff] ACPI NVS
      
      Since this is a debugfs file not used by tools there's no known ABI dependencies.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ef35b0fc
    • I
      x86/mm/pat: Disambiguate PAT-disabled boot messages · 5557e831
      Ingo Molnar 提交于
      Right now we have these four types of PAT-disabled boot messages:
      
        x86/PAT: PAT support disabled.
        x86/PAT: PAT MSR is 0, disabled.
        x86/PAT: MTRRs disabled, skipping PAT initialization too.
        x86/PAT: PAT not supported by CPU.
      
      The first message is ambiguous in that it doesn't signal that PAT is off
      due to a boot option.
      
      The second message doesn't really make it clear that this is the MSR value
      during early bootup and it's the firmware environment that disabled PAT
      support.
      
      The fourth message doesn't really make it clear that we disable PAT support
      because CONFIG_MTRR is off in the kernel.
      
      Clarify, harmonize and fix the spelling in these user-visible messages:
      
        x86/PAT: PAT support disabled via boot option.
        x86/PAT: PAT support disabled by the firmware.
        x86/PAT: PAT support disabled because CONFIG_MTRR is disabled in the kernel.
        x86/PAT: PAT not supported by the CPU.
      
      Also add a fifth message, in case PAT support is disabled at build time:
      
        x86/PAT: PAT support disabled because CONFIG_X86_PAT is disabled in the kernel.
      
      Previously we'd just silently return from pat_init() without giving any indication
      that PAT support is off.
      
      Finally, clarify/extend some of the comments related to PAT initialization.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5557e831
    • I
      x86/mm/pat: Update the comments in pat.c and pat_interval.c and refresh the code a bit · aee7f913
      Ingo Molnar 提交于
      Tidy up the code:
      
       - add comments explaining the PAT code, the role of the functions and the logic
      
       - fix various typos and grammar while at it
      
       - simplify the file-scope memtype_interval_*() namespace to interval_*()
      
       - simplify stylistic complications such as unnecessary linebreaks
         or convoluted control flow
      
       - use the simpler '#ifdef CONFIG_*' pattern instead of '#if defined(CONFIG_*)' pattern
      
       - remove the non-idiomatic newline between late_initcall() and its function definition
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aee7f913
  11. 02 12月, 2019 2 次提交
    • D
      x86/kasan: support KASAN_VMALLOC · 0609ae01
      Daniel Axtens 提交于
      In the case where KASAN directly allocates memory to back vmalloc space,
      don't map the early shadow page over it.
      
      We prepopulate pgds/p4ds for the range that would otherwise be empty.
      This is required to get it synced to hardware on boot, allowing the
      lower levels of the page tables to be filled dynamically.
      
      Link: http://lkml.kernel.org/r/20191031093909.9228-5-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0609ae01
    • I
      x86/mm/pat: Fix off-by-one bugs in interval tree search · 91298f1a
      Ingo Molnar 提交于
      There's a bug in the new PAT code, the conversion of memtype_check_conflict()
      is buggy:
      
         8d04a5f9: ("x86/mm/pat: Convert the PAT tree to a generic interval tree")
      
              dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end);
              found_type = match->type;
      
      -       node = rb_next(&match->rb);
      -       while (node) {
      -               match = rb_entry(node, struct memtype, rb);
      -
      -               if (match->start >= end) /* Checked all possible matches */
      -                       goto success;
      -
      -               if (is_node_overlap(match, start, end) &&
      -                   match->type != found_type) {
      +       match = memtype_interval_iter_next(match, start, end);
      +       while (match) {
      +               if (match->type != found_type)
                              goto failure;
      -               }
      
      -               node = rb_next(&match->rb);
      +               match = memtype_interval_iter_next(match, start, end);
              }
      
      Note how the '>= end' condition to end the interval check, got converted
      into:
      
      +       match = memtype_interval_iter_next(match, start, end);
      
      This is subtly off by one, because the interval trees interfaces require
      closed interval parameters:
      
        include/linux/interval_tree_generic.h
      
       /*                                                                            \
        * Iterate over intervals intersecting [start;last]                           \
        *                                                                            \
        * Note that a node's interval intersects [start;last] iff:                   \
        *   Cond1: ITSTART(node) <= last                                             \
        * and                                                                        \
        *   Cond2: start <= ITLAST(node)                                             \
        */                                                                           \
      
        ...
      
                      if (ITSTART(node) <= last) {            /* Cond1 */           \
                              if (start <= ITLAST(node))      /* Cond2 */           \
                                      return node;    /* node is leftmost match */  \
      
      [start;last] is a closed interval (note that '<= last' check) - while the
      PAT 'end' parameter is 1 byte beyond the end of the range, because
      ioremap() and the other mapping APIs usually use the [start,end)
      half-open interval, derived from 'size'.
      
      This is what ioremap() does for example:
      
              /*
               * Mappings have to be page-aligned
               */
              offset = phys_addr & ~PAGE_MASK;
              phys_addr &= PHYSICAL_PAGE_MASK;
              size = PAGE_ALIGN(last_addr+1) - phys_addr;
      
              retval = reserve_memtype(phys_addr, (u64)phys_addr + size,
                                                      pcm, &new_pcm);
      
      phys_addr+size will be on a page boundary, after the last byte of the
      mapped interval.
      
      So the correct parameter to use in the interval tree searches is not
      'end' but 'end-1'.
      
      This could have relevance if conflicting PAT ranges are exactly adjacent,
      for example a future WC region is followed immediately by an already
      mapped UC- region - in this case memtype_check_conflict() would
      incorrectly deny the WC memtype region and downgrade the memtype to UC-.
      
      BTW., rather annoyingly this downgrading is done silently in
      memtype_check_insert():
      
      int memtype_check_insert(struct memtype *new,
                               enum page_cache_mode *ret_type)
      {
              int err = 0;
      
              err = memtype_check_conflict(new->start, new->end, new->type, ret_type);
              if (err)
                      return err;
      
              if (ret_type)
                      new->type = *ret_type;
      
              memtype_interval_insert(new, &memtype_rbroot);
              return 0;
      }
      
      So on such a conflict we'd just silently get UC- in *ret_type, and write
      it into the new region, never the wiser ...
      
      So assuming that the patch below fixes the primary bug the diagnostics
      side of ioremap() cache attribute downgrades would be another thing to
      fix.
      
      Anyway, I checked all the interval-tree iterations, and most of them are
      off by one - but I think the one related to memtype_check_conflict() is
      the one causing this particular performance regression.
      
      The only correct interval-tree searches were these two:
      
        arch/x86/mm/pat_interval.c:     match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX);
        arch/x86/mm/pat_interval.c:             match = memtype_interval_iter_next(match, 0, ULONG_MAX);
      
      The ULONG_MAX was hiding the off-by-one in plain sight. :-)
      
      Note that the bug was probably benign in the sense of implementing a too
      strict cache attribute conflict policy and downgrading cache attributes,
      so AFAICS the worst outcome of this bug would be a performance regression,
      not any instabilities.
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Reported-by: NKenneth R. Crudup <kenny@panix.com>
      Reported-by: NMariusz Ceier <mceier+kernel@gmail.com>
      Tested-by: NMariusz Ceier <mceier@gmail.com>
      Tested-by: NKenneth R. Crudup <kenny@panix.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191201144947.GA4167@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      91298f1a
  12. 27 11月, 2019 3 次提交
    • P
      x86/mm: Remove set_kernel_text_r[ow]() · c12af440
      Peter Zijlstra 提交于
      With the last and only user of these functions gone (ftrace) remove
      them as well to avoid ever growing new users.
      Tested-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191111132457.819095320@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c12af440
    • A
      x86/doublefault/32: Move #DF stack and TSS to cpu_entry_area · dc4e0021
      Andy Lutomirski 提交于
      There are three problems with the current layout of the doublefault
      stack and TSS.  First, the TSS is only cacheline-aligned, which is
      not enough -- if the hardware portion of the TSS (struct x86_hw_tss)
      crosses a page boundary, horrible things happen [0].  Second, the
      stack and TSS are global, so simultaneous double faults on different
      CPUs will cause massive corruption.  Third, the whole mechanism
      won't work if user CR3 is loaded, resulting in a triple fault [1].
      
      Let the doublefault stack and TSS share a page (which prevents the
      TSS from spanning a page boundary), make it percpu, and move it into
      cpu_entry_area.  Teach the stack dump code about the doublefault
      stack.
      
      [0] Real hardware will read past the end of the page onto the next
          *physical* page if a task switch happens.  Virtual machines may
          have any number of bugs, and I would consider it reasonable for
          a VM to summarily kill the guest if it tries to task-switch to
          a page-spanning TSS.
      
      [1] Real hardware triple faults.  At least some VMs seem to hang.
          I'm not sure what's going on.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dc4e0021
    • J
      x86/mm/32: Sync only to VMALLOC_END in vmalloc_sync_all() · 9a62d200
      Joerg Roedel 提交于
      The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc()
      ranges: before such vmap ranges are reused we make sure that they are
      unmapped from every task's page tables.
      
      This is really easy on pagetable setups where the kernel page tables
      are shared between all tasks - this is the case on 32-bit kernels
      with SHARED_KERNEL_PMD = 1.
      
      But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating
      over the pgd_list and clearing all pmd entries in the pgds that
      are cleared in the init_mm.pgd, which is the reference pagetable
      that the vmalloc() code uses.
      
      In that context the current practice of vmalloc_sync_all() iterating
      until FIX_ADDR_TOP is buggy:
      
              for (address = VMALLOC_START & PMD_MASK;
                   address >= TASK_SIZE_MAX && address < FIXADDR_TOP;
                   address += PMD_SIZE) {
                      struct page *page;
      
      Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc
      address ranges:
      
      	VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR
      
      This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges
      that don't clear their pmds, but it's lethal for the LDT range,
      which relies on having different mappings in different processes,
      and 'synchronizing' them in the vmalloc sense corrupts those
      pagetable entries (clearing them).
      
      This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD
      off and makes this the dominant mapping mode on 32-bit.
      
      To make LDT working again vmalloc_sync_all() must only iterate over
      the volatile parts of the kernel address range that are identical
      between all processes.
      
      So the correct check in vmalloc_sync_all() is "address < VMALLOC_END"
      to make sure the VMALLOC areas are synchronized and the LDT
      mapping is not falsely overwritten.
      
      The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either,
      but this is not really a proplem since their PMDs get established
      during bootup and never change.
      
      This change fixes the ldt_gdt selftest in my setup.
      
      [ mingo: Fixed up the changelog to explain the logic and modified the
               copying to only happen up until VMALLOC_END. ]
      Reported-by: NBorislav Petkov <bp@suse.de>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Cc: <stable@vger.kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Fixes: 7757d607: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32")
      Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a62d200
  13. 25 11月, 2019 2 次提交
    • W
      locking/refcount: Consolidate implementations of refcount_t · fb041bb7
      Will Deacon 提交于
      The generic implementation of refcount_t should be good enough for
      everybody, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely,
      leaving the generic implementation enabled unconditionally.
      Signed-off-by: NWill Deacon <will@kernel.org>
      Reviewed-by: NArd Biesheuvel <ardb@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Tested-by: NHanjun Guo <guohanjun@huawei.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191121115902.2551-9-will@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb041bb7
    • I
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the... · 05b042a1
      Ingo Molnar 提交于
      x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise
      
      When two recent commits that increased the size of the 'struct cpu_entry_area'
      were merged in -tip, the 32-bit defconfig build started failing on the following
      build time assert:
      
        ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
        arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’
        In function ‘setup_cpu_entry_area_ptes’,
      
      Which corresponds to the following build time assert:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      The purpose of this assert is to sanity check the fixed-value definition of
      CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h:
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 41)
      
      The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value
      we didn't want to define in such a low level header, because it would cause
      dependency hell.
      
      Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES
      accordingly - and this assert is checking that constraint.
      
      But the assert is both imprecise and buggy, primarily because it doesn't
      include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE
      (which begins at a PMD boundary).
      
      This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined
      too large upstream (v5.4-rc8):
      
      	#define CPU_ENTRY_AREA_PAGES    (NR_CPUS * 40)
      
      While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra
      pages, which hid the bug.
      
      The following commit (not yet upstream) increased the size to 40 pages:
      
        x86/iopl: ("Restrict iopl() permission scope")
      
      ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap
      to just 1 extra page.
      
      Then another not-yet-upstream commit changed the size again:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Which increased the cpu_entry_area size from 38 to 39 pages, but
      didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked
      fine, because we still had a page left from the accidental 'reserve'.
      
      But when these two commits were merged into the same tree, the
      combined size of cpu_entry_area grew from 38 to 40 pages, while
      CPU_ENTRY_AREA_PAGES finally caught up to 40 as well.
      
      Which is fine in terms of functionality, but the assert broke:
      
      	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
      
      because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area,
      which is 1 page larger due to the IDT page.
      
      To fix all this, change the assert to two precise asserts:
      
      	BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      	BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE);
      
      This takes the IDT page into account, and also connects the size-based
      define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based
      define of CPU_ENTRY_AREA_MAP_SIZE.
      
      Also clean up some of the names which made it rather confusing:
      
       - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of
         the cpu-entry-area, but the per-cpu array size, so rename this
         to CPU_ENTRY_AREA_ARRAY_SIZE.
      
       - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping
         size, with the IDT included.
      
       - Add comments where '+1' denotes the IDT mapping - it wasn't
         obvious and took me about 3 hours to decode...
      
      Finally, because this particular commit is actually applied after
      this patch:
      
        880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      
      Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages.
      
      All future commits that change cpu_entry_area will have to adjust
      this value precisely.
      
      As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES
      and derive its value directly from the structure, without causing
      header hell - but that is an adventure for another day! :-)
      
      Fixes: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit")
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05b042a1
  14. 22 11月, 2019 5 次提交
  15. 18 11月, 2019 1 次提交
  16. 16 11月, 2019 1 次提交
    • T
      x86/tss: Fix and move VMX BUILD_BUG_ON() · 6b546e1c
      Thomas Gleixner 提交于
      The BUILD_BUG_ON(IO_BITMAP_OFFSET - 1 == 0x67) in the VMX code is bogus in
      two aspects:
      
      1) This wants to be in generic x86 code simply to catch issues even when
         VMX is disabled in Kconfig.
      
      2) The IO_BITMAP_OFFSET is not the right thing to check because it makes
         asssumptions about the layout of tss_struct. Nothing requires that the
         I/O bitmap is placed right after x86_tss, which is the hardware mandated
         tss structure. It pointlessly makes restrictions on the struct
         tss_struct layout.
      
      The proper thing to check is:
      
          - Offset of x86_tss in tss_struct is 0
          - Size of x86_tss == 0x68
      
      Move it to the other build time TSS checks and make it do the right thing.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      6b546e1c
  17. 12 11月, 2019 1 次提交