1. 18 3月, 2021 1 次提交
    • I
      x86: Fix various typos in comments · d9f6e12f
      Ingo Molnar 提交于
      Fix ~144 single-word typos in arch/x86/ code comments.
      
      Doing this in a single commit should reduce the churn.
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: linux-kernel@vger.kernel.org
      d9f6e12f
  2. 05 1月, 2021 1 次提交
    • L
      x86/mm: Increase pgt_buf size for 5-level page tables · 167dcfc0
      Lorenzo Stoakes 提交于
      pgt_buf is used to allocate page tables on initial direct page mapping
      which bootstraps the kernel into being able to allocate these before the
      direct mapping makes further pages available.
      
      INIT_PGD_PAGE_COUNT is set to 6 pages (doubled for KASLR) - 3 (PUD, PMD,
      PTE) for the 1 MiB ISA mapping and 3 more for the first direct mapping
      assignment in each case providing 2 MiB of address space.
      
      This has not been updated for 5-level page tables which has an
      additional P4D page table level above PUD.
      
      In most instances, this will not have a material impact as the first
      4 page levels allocated for the ISA mapping will provide sufficient
      address space to encompass all further address mappings.
      
      If the first direct mapping is within 512 GiB of the ISA mapping, only
      a PMD and PTE needs to be added in the instance the kernel is using 4
      KiB page tables (e.g. CONFIG_DEBUG_PAGEALLOC is enabled) and only a PMD
      if the kernel can use 2 MiB pages (the first allocation is limited to
      PMD_SIZE so a GiB page cannot be used there).
      
      However, if the machine has more than 512 GiB of RAM and the kernel is
      allocating 4 KiB page size, 3 further page tables are required.
      
      If the machine has more than 256 TiB of RAM at 4 KiB or 2 MiB page size,
      further 3 or 4 page tables are required respectively.
      
      Update INIT_PGD_PAGE_COUNT to reflect this.
      
       [ bp: Sanitize text into passive voice without ambiguous personal pronouns. ]
      Signed-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDave Hansen <dave.hansen@intel.com>
      Link: https://lkml.kernel.org/r/20201215205641.34096-1-lstoakes@gmail.com
      167dcfc0
  3. 20 11月, 2020 1 次提交
  4. 17 6月, 2020 1 次提交
    • B
      x86/mm: Fix -Wmissing-prototypes warnings for arch/x86/mm/init.c · d5249bc7
      Benjamin Thiel 提交于
      Fix -Wmissing-prototypes warnings:
      
        arch/x86/mm/init.c:81:6:
        warning: no previous prototype for ‘x86_has_pat_wp’ [-Wmissing-prototypes]
        bool x86_has_pat_wp(void)
      
        arch/x86/mm/init.c:86:22:
        warning: no previous prototype for ‘pgprot2cachemode’ [-Wmissing-prototypes]
        enum page_cache_mode pgprot2cachemode(pgprot_t pgprot)
      
      by including the respective header containing prototypes. Also fix:
      
        arch/x86/mm/init.c:893:13:
        warning: no previous prototype for ‘mem_encrypt_free_decrypted_mem’ [-Wmissing-prototypes]
        void __weak mem_encrypt_free_decrypted_mem(void) { }
      
      by making it static inline for the !CONFIG_AMD_MEM_ENCRYPT case. This
      warning happens when CONFIG_AMD_MEM_ENCRYPT is not enabled (defconfig
      for example):
      
        ./arch/x86/include/asm/mem_encrypt.h:80:27:
        warning: inline function ‘mem_encrypt_free_decrypted_mem’ declared weak [-Wattributes]
        static inline void __weak mem_encrypt_free_decrypted_mem(void) { }
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      It's ok to convert to static inline because the function is used only in
      x86. Is not shared with other architectures so drop the __weak too.
      
       [ bp: Massage and adjust __weak comments while at it. ]
      Signed-off-by: NBenjamin Thiel <b.thiel@posteo.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20200606122629.2720-1-b.thiel@posteo.de
      d5249bc7
  5. 10 6月, 2020 1 次提交
    • M
      x86/mm: simplify init_trampoline() and surrounding logic · 88107d33
      Mike Rapoport 提交于
      There are three cases for the trampoline initialization:
      * 32-bit does nothing
      * 64-bit with kaslr disabled simply copies a PGD entry from the direct map
        to the trampoline PGD
      * 64-bit with kaslr enabled maps the real mode trampoline at PUD level
      
      These cases are currently differentiated by a bunch of ifdefs inside
      asm/include/pgtable.h and the case of 64-bits with kaslr on uses
      pgd_index() helper.
      
      Replacing the ifdefs with a static function in arch/x86/mm/init.c gives
      clearer code and allows moving pgd_index() to the generic implementation
      in include/linux/pgtable.h
      
      [rppt@linux.ibm.com: take CONFIG_RANDOMIZE_MEMORY into account in kaslr_enabled()]
        Link: http://lkml.kernel.org/r/20200525104045.GB13212@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vincent Chen <deanbo422@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200514170327.31389-8-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88107d33
  6. 04 6月, 2020 1 次提交
    • M
      mm: use free_area_init() instead of free_area_init_nodes() · 9691a071
      Mike Rapoport 提交于
      free_area_init() has effectively became a wrapper for
      free_area_init_nodes() and there is no point of keeping it.  Still
      free_area_init() name is shorter and more general as it does not imply
      necessity to initialize multiple nodes.
      
      Rename free_area_init_nodes() to free_area_init(), update the callers and
      drop old version of free_area_init().
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Hoan Tran <hoan@os.amperecomputing.com>	[arm64]
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Brian Cain <bcain@codeaurora.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nick Hu <nickhu@andestech.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: http://lkml.kernel.org/r/20200412194859.12663-6-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9691a071
  7. 23 5月, 2020 1 次提交
  8. 27 4月, 2020 2 次提交
  9. 23 4月, 2020 1 次提交
  10. 20 4月, 2020 2 次提交
  11. 11 4月, 2020 1 次提交
  12. 05 11月, 2019 1 次提交
    • K
      x86/mm: Report which part of kernel image is freed · 5494c3a6
      Kees Cook 提交于
      The memory freeing report wasn't very useful for figuring out which
      parts of the kernel image were being freed. Add the details for clearer
      reporting in dmesg.
      
      Before:
      
        Freeing unused kernel image memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image memory: 2040K
        Freeing unused kernel image memory: 172K
      
      After:
      
        Freeing unused kernel image (initmem) memory: 1348K
        Write protecting the kernel read-only data: 20480k
        Freeing unused kernel image (text/rodata gap) memory: 2040K
        Freeing unused kernel image (rodata/data gap) memory: 172K
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: linux-alpha@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-c6x-dev@linux-c6x.org
      Cc: linux-ia64@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Cc: Segher Boessenkool <segher@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: x86-ml <x86@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20191029211351.13243-28-keescook@chromium.org
      5494c3a6
  13. 30 4月, 2019 1 次提交
  14. 24 4月, 2019 1 次提交
    • Q
      x86/mm: Fix a crash with kmemleak_scan() · 0d02113b
      Qian Cai 提交于
      The first kmemleak_scan() call after boot would trigger the crash below
      because this callpath:
      
        kernel_init
          free_initmem
            mem_encrypt_free_decrypted_mem
              free_init_pages
      
      unmaps memory inside the .bss when DEBUG_PAGEALLOC=y.
      
      kmemleak_init() will register the .data/.bss sections and then
      kmemleak_scan() will scan those addresses and dereference them looking
      for pointer references. If free_init_pages() frees and unmaps pages in
      those sections, kmemleak_scan() will crash if referencing one of those
      addresses:
      
        BUG: unable to handle kernel paging request at ffffffffbd402000
        CPU: 12 PID: 325 Comm: kmemleak Not tainted 5.1.0-rc4+ #4
        RIP: 0010:scan_block
        Call Trace:
         scan_gray_list
         kmemleak_scan
         kmemleak_scan_thread
         kthread
         ret_from_fork
      
      Since kmemleak_free_part() is tolerant to unknown objects (not tracked
      by kmemleak), it is fine to call it from free_init_pages() even if not
      all address ranges passed to this function are known to kmemleak.
      
       [ bp: Massage. ]
      
      Fixes: b3f0907c ("x86/mm: Add .bss..decrypted section to hold shared variables")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190423165811.36699-1-cai@lca.pw
      0d02113b
  15. 29 12月, 2018 1 次提交
  16. 11 12月, 2018 1 次提交
  17. 31 10月, 2018 1 次提交
  18. 16 9月, 2018 1 次提交
    • B
      x86/mm: Add .bss..decrypted section to hold shared variables · b3f0907c
      Brijesh Singh 提交于
      kvmclock defines few static variables which are shared with the
      hypervisor during the kvmclock initialization.
      
      When SEV is active, memory is encrypted with a guest-specific key, and
      if the guest OS wants to share the memory region with the hypervisor
      then it must clear the C-bit before sharing it.
      
      Currently, we use kernel_physical_mapping_init() to split large pages
      before clearing the C-bit on shared pages. But it fails when called from
      the kvmclock initialization (mainly because the memblock allocator is
      not ready that early during boot).
      
      Add a __bss_decrypted section attribute which can be used when defining
      such shared variable. The so-defined variables will be placed in the
      .bss..decrypted section. This section will be mapped with C=0 early
      during boot.
      
      The .bss..decrypted section has a big chunk of memory that may be unused
      when memory encryption is not active, free it when memory encryption is
      not active.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Radim Krčmář<rkrcmar@redhat.com>
      Cc: kvm@vger.kernel.org
      Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
      b3f0907c
  19. 24 8月, 2018 1 次提交
  20. 21 8月, 2018 2 次提交
  21. 15 8月, 2018 1 次提交
  22. 07 8月, 2018 1 次提交
    • D
      x86/mm/init: Remove freed kernel image areas from alias mapping · c40a56a7
      Dave Hansen 提交于
      The kernel image is mapped into two places in the virtual address space
      (addresses without KASLR, of course):
      
      	1. The kernel direct map (0xffff880000000000)
      	2. The "high kernel map" (0xffffffff81000000)
      
      We actually execute out of #2.  If we get the address of a kernel symbol,
      it points to #2, but almost all physical-to-virtual translations point to
      
      Parts of the "high kernel map" alias are mapped in the userspace page
      tables with the Global bit for performance reasons.  The parts that we map
      to userspace do not (er, should not) have secrets. When PTI is enabled then
      the global bit is usually not set in the high mapping and just used to
      compensate for poor performance on systems which lack PCID.
      
      This is fine, except that some areas in the kernel image that are adjacent
      to the non-secret-containing areas are unused holes.  We free these holes
      back into the normal page allocator and reuse them as normal kernel memory.
      The memory will, of course, get *used* via the normal map, but the alias
      mapping is kept.
      
      This otherwise unused alias mapping of the holes will, by default keep the
      Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
      
      Remove the alias mapping of these pages entirely.  This is likely to
      fracture the 2M page mapping the kernel image near these areas, but this
      should affect a minority of the area.
      
      The pageattr code changes *all* aliases mapping the physical pages that it
      operates on (by default).  We only want to modify a single alias, so we
      need to tweak its behavior.
      
      This unmapping behavior is currently dependent on PTI being in place.
      Going forward, we should at least consider doing this for all
      configurations.  Having an extra read-write alias for memory is not exactly
      ideal for debugging things like random memory corruption and this does
      undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
      Frame Ownership (XPFO).
      
      Before this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
      current_kernel-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82e00000           2M     RW                     NX pte
      current_kernel-0xffffffff82e00000-0xffffffff83200000           4M     RW         PSE         NX pmd
      current_kernel-0xffffffff83200000-0xffffffffa0000000         462M                               pmd
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K     RW                     NX pte
        current_user-0xffffffff82000000-0xffffffff82600000           6M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      After this patch:
      
      current_kernel:---[ High Kernel Mapping ]---
      current_kernel-0xffffffff80000000-0xffffffff81000000          16M                               pmd
      current_kernel-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
      current_kernel-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
      current_kernel-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
      current_kernel-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
      current_kernel-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
      current_kernel-0xffffffff82488000-0xffffffff82600000        1504K                               pte
      current_kernel-0xffffffff82600000-0xffffffff82c00000           6M     RW         PSE         NX pmd
      current_kernel-0xffffffff82c00000-0xffffffff82c0d000          52K     RW                     NX pte
      current_kernel-0xffffffff82c0d000-0xffffffff82dc0000        1740K                               pte
      
        current_user:---[ High Kernel Mapping ]---
        current_user-0xffffffff80000000-0xffffffff81000000          16M                               pmd
        current_user-0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
        current_user-0xffffffff81e00000-0xffffffff81e11000          68K     ro                 GLB x  pte
        current_user-0xffffffff81e11000-0xffffffff82000000        1980K                               pte
        current_user-0xffffffff82000000-0xffffffff82400000           4M     ro         PSE     GLB NX pmd
        current_user-0xffffffff82400000-0xffffffff82488000         544K     ro                     NX pte
        current_user-0xffffffff82488000-0xffffffff82600000        1504K                               pte
        current_user-0xffffffff82600000-0xffffffffa0000000         474M                               pmd
      
      [ tglx: Do not unmap on 32bit as there is only one mapping ]
      
      Fixes: 0f561fce ("x86/pti: Enable global pages for shared areas")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
      c40a56a7
  23. 06 8月, 2018 1 次提交
    • D
      x86/mm/init: Add helper for freeing kernel image pages · 6ea2738e
      Dave Hansen 提交于
      When chunks of the kernel image are freed, free_init_pages() is used
      directly.  Consolidate the three sites that do this.  Also update the
      string to give an incrementally better description of that memory versus
      what was there before.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
      6ea2738e
  24. 27 6月, 2018 1 次提交
    • V
      x86/speculation/l1tf: Protect PAE swap entries against L1TF · 0d0f6249
      Vlastimil Babka 提交于
      The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
      offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
      offset. The lower word is zeroed, thus systems with less than 4GB memory are
      safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
      to L1TF; with even more memory, also the swap offfset influences the address.
      This might be a problem with 32bit PAE guests running on large 64bit hosts.
      
      By continuing to keep the whole swap entry in either high or low 32bit word of
      PTE we would limit the swap size too much. Thus this patch uses the whole PAE
      PTE with the same layout as the 64bit version does. The macros just become a
      bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      0d0f6249
  25. 21 6月, 2018 2 次提交
  26. 15 6月, 2018 1 次提交
  27. 12 4月, 2018 1 次提交
    • D
      x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image · 39114b7a
      Dave Hansen 提交于
      Summary:
      
      In current kernels, with PTI enabled, no pages are marked Global. This
      potentially increases TLB misses.  But, the mechanism by which the Global
      bit is set and cleared is rather haphazard.  This patch makes the process
      more explicit.  In the end, it leaves us with Global entries in the page
      tables for the areas truly shared by userspace and kernel and increases
      TLB hit rates.
      
      The place this patch really shines in on systems without PCIDs.  In this
      case, we are using an lseek microbenchmark[1] to see how a reasonably
      non-trivial syscall behaves.  Higher is better:
      
        No Global pages (baseline): 6077741 lseeks/sec
        88 Global Pages (this set): 7528609 lseeks/sec (+23.9%)
      
      On a modern Skylake desktop with PCIDs, the benefits are tangible, but not
      huge for a kernel compile (lower is better):
      
        No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
        28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
                                     -1.195 seconds (-0.64%)
      
      I also re-checked everything using the lseek1 test[1]:
      
        No Global pages (baseline): 15783951 lseeks/sec
        28 Global pages (this set): 16054688 lseeks/sec
      			     +270737 lseeks/sec (+1.71%)
      
      The effect is more visible, but still modest.
      
      Details:
      
      The kernel page tables are inherited from head_64.S which rudely marks
      them as _PAGE_GLOBAL.  For PTI, we have been relying on the grace of
      $DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL.
      This patch tries to do better.
      
      First, stop filtering out "unsupported" bits from being cleared in the
      pageattr code.  It's fine to filter out *setting* these bits but it
      is insane to keep us from clearing them.
      
      Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map.
      Do not rely on pageattr to do it magically.
      
      After this patch, we can see that "GLB" shows up in each copy of the
      page tables, that we have the same number of global entries in each
      and that they are the *same* entries.
      
        /sys/kernel/debug/page_tables/current_kernel:11
        /sys/kernel/debug/page_tables/current_user:11
        /sys/kernel/debug/page_tables/kernel:11
      
        9caae8ad6a1fb53aca2407ec037f612d  current_kernel.GLB
        9caae8ad6a1fb53aca2407ec037f612d  current_user.GLB
        9caae8ad6a1fb53aca2407ec037f612d  kernel.GLB
      
      A quick visual audit also shows that all the entries make sense.
      0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000
      is the entry/exit text:
      
        0xfffffe0000000000-0xfffffe0000002000           8K     ro                 GLB NX pte
        0xfffffe0000002000-0xfffffe0000003000           4K     RW                 GLB NX pte
        0xfffffe0000003000-0xfffffe0000006000          12K     ro                 GLB NX pte
        0xfffffe0000006000-0xfffffe0000007000           4K     ro                 GLB x  pte
        0xfffffe0000007000-0xfffffe000000d000          24K     RW                 GLB NX pte
        0xfffffe000002d000-0xfffffe000002e000           4K     ro                 GLB NX pte
        0xfffffe000002e000-0xfffffe000002f000           4K     RW                 GLB NX pte
        0xfffffe000002f000-0xfffffe0000032000          12K     ro                 GLB NX pte
        0xfffffe0000032000-0xfffffe0000033000           4K     ro                 GLB x  pte
        0xfffffe0000033000-0xfffffe0000039000          24K     RW                 GLB NX pte
        0xffffffff81c00000-0xffffffff81e00000           2M     ro         PSE     GLB x  pmd
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205517.C80FBE05@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39114b7a
  28. 10 4月, 2018 1 次提交
    • D
      x86/mm: Introduce "default" kernel PTE mask · 8a57f484
      Dave Hansen 提交于
      The __PAGE_KERNEL_* page permissions are "raw".  They contain bits
      that may or may not be supported on the current processor.  They need
      to be filtered by a mask (currently __supported_pte_mask) to turn them
      into a value that we can actually set in a PTE.
      
      These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL.  But, with PTI,
      we want to be able to support _PAGE_GLOBAL (have the bit set in
      __supported_pte_mask) but not have it appear in any of these masks by
      default.
      
      This patch creates a new mask, __default_kernel_pte_mask, and applies
      it when creating all of the PAGE_KERNEL_* masks.  This makes
      PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
      It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
      kernels but clears _PAGE_GLOBAL when PTI=y.
      
      We also make __default_kernel_pte_mask a non-GPL exported symbol
      because there are plenty of driver-available interfaces that take
      PAGE_KERNEL_* permissions.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a57f484
  29. 05 1月, 2018 1 次提交
    • T
      x86/tlb: Drop the _GPL from the cpu_tlbstate export · 1e547681
      Thomas Gleixner 提交于
      The recent changes for PTI touch cpu_tlbstate from various tlb_flush
      inlines. cpu_tlbstate is exported as GPL symbol, so this causes a
      regression when building out of tree drivers for certain graphics cards.
      
      Aside of that the export was wrong since it was introduced as it should
      have been EXPORT_PER_CPU_SYMBOL_GPL().
      
      Use the correct PER_CPU export and drop the _GPL to restore the previous
      state which allows users to utilize the cards they payed for.
      
      As always I'm really thrilled to make this kind of change to support the
      #friends (or however the hot hashtag of today is spelled) from that closet
      sauce graphics corp.
      
      Fixes: 1e02ce4c ("x86: Store a per-cpu shadow copy of CR4")
      Fixes: 6fd166aa ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
      Reported-by: NKees Cook <keescook@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: stable@vger.kernel.org
      1e547681
  30. 24 12月, 2017 4 次提交
    • D
      x86/mm: Use INVPCID for __native_flush_tlb_single() · 6cff64b8
      Dave Hansen 提交于
      This uses INVPCID to shoot down individual lines of the user mapping
      instead of marking the entire user map as invalid. This
      could/might/possibly be faster.
      
      This for sure needs tlb_single_page_flush_ceiling to be redetermined;
      esp. since INVPCID is _slow_.
      
      A detailed performance analysis is available here:
      
        https://lkml.kernel.org/r/3062e486-3539-8a1f-5724-16199420be71@intel.com
      
      [ Peterz: Split out from big combo patch ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6cff64b8
    • P
      x86/mm: Use/Fix PCID to optimize user/kernel switches · 6fd166aa
      Peter Zijlstra 提交于
      We can use PCID to retain the TLBs across CR3 switches; including those now
      part of the user/kernel switch. This increases performance of kernel
      entry/exit at the cost of more expensive/complicated TLB flushing.
      
      Now that we have two address spaces, one for kernel and one for user space,
      we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
      (just like we use the PFN LSB for the PGD). Since we do TLB invalidation
      from kernel space, the existing code will only invalidate the kernel PCID,
      we augment that by marking the corresponding user PCID invalid, and upon
      switching back to userspace, use a flushing CR3 write for the switch.
      
      In order to access the user_pcid_flush_mask we use PER_CPU storage, which
      means the previously established SWAPGS vs CR3 ordering is now mandatory
      and required.
      
      Having to do this memory access does require additional registers, most
      sites have a functioning stack and we can spill one (RAX), sites without
      functional stack need to otherwise provide the second scratch register.
      
      Note: PCID is generally available on Intel Sandybridge and later CPUs.
      Note: Up until this point TLB flushing was broken in this series.
      
      Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6fd166aa
    • T
      x86/mm/pti: Add infrastructure for page table isolation · aa8c6248
      Thomas Gleixner 提交于
      Add the initial files for kernel page table isolation, with a minimal init
      function and the boot time detection for this misfeature.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      aa8c6248
    • D
      x86/mm/pti: Disable global pages if PAGE_TABLE_ISOLATION=y · c313ec66
      Dave Hansen 提交于
      Global pages stay in the TLB across context switches.  Since all contexts
      share the same kernel mapping, these mappings are marked as global pages
      so kernel entries in the TLB are not flushed out on a context switch.
      
      But, even having these entries in the TLB opens up something that an
      attacker can use, such as the double-page-fault attack:
      
         http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
      
      That means that even when PAGE_TABLE_ISOLATION switches page tables
      on return to user space the global pages would stay in the TLB cache.
      
      Disable global pages so that kernel TLB entries can be flushed before
      returning to user space. This way, all accesses to kernel addresses from
      userspace result in a TLB miss independent of the existence of a kernel
      mapping.
      
      Suppress global pages via the __supported_pte_mask. The user space
      mappings set PAGE_GLOBAL for the minimal kernel mappings which are
      required for entry/exit. These mappings are set up manually so the
      filtering does not take place.
      
      [ The __supported_pte_mask simplification was written by Thomas Gleixner. ]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c313ec66
  31. 16 11月, 2017 2 次提交
  32. 10 11月, 2017 1 次提交