1. 25 4月, 2018 2 次提交
    • D
      x86/pti: Fix boot warning from Global-bit setting · 58e65b51
      Dave Hansen 提交于
      commit 231df823c4f04176f607afc4576c989895cff40e
      
      The pageattr.c code attempts to process "faults" when it goes looking
      for PTEs to change and finds non-present entries.  It allows these
      faults in the linear map which is "expected to have holes", but
      WARN()s about them elsewhere, like when called on the kernel image.
      
      However, change_page_attr_clear() is now called on the kernel image in the
      process of trying to clear the Global bit.
      
      This trips the warning in __cpa_process_fault() if a non-present PTE is
      encountered in the kernel image.  The "holes" in the kernel image result
      from free_init_pages()'s use of set_memory_np().  These holes are totally
      fine, and result from normal operation, just as they would be in the kernel
      linear map.
      
      Just silence the warning when holes in the kernel image are encountered.
      
      Fixes: 39114b7a (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
      Reported-by: NMariusz Ceier <mceier@gmail.com>
      Reported-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222021.1C7D2B3F@viggo.jf.intel.com
      
      58e65b51
    • D
      x86/pti: Fix boot problems from Global-bit setting · d2479a30
      Dave Hansen 提交于
      commit 16dce603adc9de4237b7bf2ff5c5290f34373e7b
      
      Part of the global bit _setting_ patches also includes clearing the
      Global bit when it should not be enabled.  That is done with
      set_memory_nonglobal(), which uses change_page_attr_clear() in
      pageattr.c under the covers.
      
      The TLB flushing code inside pageattr.c has has checks like
      BUG_ON(irqs_disabled()), looking for interrupt disabling that might
      cause deadlocks.  But, these also trip in early boot on certain
      preempt configurations.  Just copy the existing BUG_ON() sequence from
      cpa_flush_range() to the other two sites and check for early boot.
      
      Fixes: 39114b7a (x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image)
      Reported-by: NMariusz Ceier <mceier@gmail.com>
      Reported-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NAaro Koskinen <aaro.koskinen@nokia.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: linux-mm@kvack.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180420222019.20C4A410@viggo.jf.intel.com
      
      d2479a30
  2. 17 4月, 2018 1 次提交
  3. 12 4月, 2018 9 次提交
    • J
      x86/pgtable: Don't set huge PUD/PMD on non-leaf entries · e3e28812
      Joerg Roedel 提交于
      The pmd_set_huge() and pud_set_huge() functions are used from
      the generic ioremap() code to establish large mappings where this
      is possible.
      
      But the generic ioremap() code does not check whether the
      PMD/PUD entries are already populated with a non-leaf entry,
      so that any page-table pages these entries point to will be
      lost.
      
      Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
      BUG_ON() in vmalloc_sync_one() when PMD entries are synced
      from swapper_pg_dir to the current page-table. This happens
      because the PMD entry from swapper_pg_dir was promoted to a
      huge-page entry while the current PGD still contains the
      non-leaf entry. Because both entries are present and point
      to a different page, the BUG_ON() triggers.
      
      This was actually triggered with pti-x32 enabled in a KVM
      virtual machine by the graphics driver.
      
      A real and better fix for that would be to improve the
      page-table handling in the generic ioremap() code. But that is
      out-of-scope for this patch-set and left for later work.
      Reported-by: NDavid H. Gutteridge <dhgutteridge@sympatico.ca>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3e28812
    • D
      x86/pti: Leave kernel text global for !PCID · 8c06c774
      Dave Hansen 提交于
      Global pages are bad for hardening because they potentially let an
      exploit read the kernel image via a Meltdown-style attack which
      makes it easier to find gadgets.
      
      But, global pages are good for performance because they reduce TLB
      misses when making user/kernel transitions, especially when PCIDs
      are not available, such as on older hardware, or where a hypervisor
      has disabled them for some reason.
      
      This patch implements a basic, sane policy: If you have PCIDs, you
      only map a minimal amount of kernel text global.  If you do not have
      PCIDs, you map all kernel text global.
      
      This policy effectively makes PCIDs something that not only adds
      performance but a little bit of hardening as well.
      
      I ran a simple "lseek" microbenchmark[1] to test the benefit on
      a modern Atom microserver.  Most of the benefit comes from applying
      the series before this patch ("entry only"), but there is still a
      signifiant benefit from this patch.
      
        No Global Lines (baseline  ): 6077741 lseeks/sec
        88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%)
        94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%)
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205518.E3D989EB@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c06c774
    • D
      x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image · 39114b7a
      Dave Hansen 提交于
      Summary:
      
      In current kernels, with PTI enabled, no pages are marked Global. This
      potentially increases TLB misses.  But, the mechanism by which the Global
      bit is set and cleared is rather haphazard.  This patch makes the process
      more explicit.  In the end, it leaves us with Global entries in the page
      tables for the areas truly shared by userspace and kernel and increases
      TLB hit rates.
      
      The place this patch really shines in on systems without PCIDs.  In this
      case, we are using an lseek microbenchmark[1] to see how a reasonably
      non-trivial syscall behaves.  Higher is better:
      
        No Global pages (baseline): 6077741 lseeks/sec
        88 Global Pages (this set): 7528609 lseeks/sec (+23.9%)
      
      On a modern Skylake desktop with PCIDs, the benefits are tangible, but not
      huge for a kernel compile (lower is better):
      
        No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
        28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
                                     -1.195 seconds (-0.64%)
      
      I also re-checked everything using the lseek1 test[1]:
      
        No Global pages (baseline): 15783951 lseeks/sec
        28 Global pages (this set): 16054688 lseeks/sec
      			     +270737 lseeks/sec (+1.71%)
      
      The effect is more visible, but still modest.
      
      Details:
      
      The kernel page tables are inherited from head_64.S which rudely marks
      them as _PAGE_GLOBAL.  For PTI, we have been relying on the grace of
      $DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL.
      This patch tries to do better.
      
      First, stop filtering out "unsupported" bits from being cleared in the
      pageattr code.  It's fine to filter out *setting* these bits but it
      is insane to keep us from clearing them.
      
      Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map.
      Do not rely on pageattr to do it magically.
      
      After this patch, we can see that "GLB" shows up in each copy of the
      page tables, that we have the same number of global entries in each
      and that they are the *same* entries.
      
        /sys/kernel/debug/page_tables/current_kernel:11
        /sys/kernel/debug/page_tables/current_user:11
        /sys/kernel/debug/page_tables/kernel:11
      
        9caae8ad6a1fb53aca2407ec037f612d  current_kernel.GLB
        9caae8ad6a1fb53aca2407ec037f612d  current_user.GLB
        9caae8ad6a1fb53aca2407ec037f612d  kernel.GLB
      
      A quick visual audit also shows that all the entries make sense.
      0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000
      is the entry/exit text:
      
        0xfffffe0000000000-0xfffffe0000002000           8K     ro                 GLB NX pte
        0xfffffe0000002000-0xfffffe0000003000           4K     RW                 GLB NX pte
        0xfffffe0000003000-0xfffffe0000006000          12K     ro                 GLB NX pte
        0xfffffe0000006000-0xfffffe0000007000           4K     ro                 GLB x  pte
        0xfffffe0000007000-0xfffffe000000d000          24K     RW                 GLB NX pte
        0xfffffe000002d000-0xfffffe000002e000           4K     ro                 GLB NX pte
        0xfffffe000002e000-0xfffffe000002f000           4K     RW                 GLB NX pte
        0xfffffe000002f000-0xfffffe0000032000          12K     ro                 GLB NX pte
        0xfffffe0000032000-0xfffffe0000033000           4K     ro                 GLB x  pte
        0xfffffe0000033000-0xfffffe0000039000          24K     RW                 GLB NX pte
        0xffffffff81c00000-0xffffffff81e00000           2M     ro         PSE     GLB x  pmd
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205517.C80FBE05@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39114b7a
    • D
      x86/pti: Enable global pages for shared areas · 0f561fce
      Dave Hansen 提交于
      The entry/exit text and cpu_entry_area are mapped into userspace and
      the kernel.  But, they are not _PAGE_GLOBAL.  This creates unnecessary
      TLB misses.
      
      Add the _PAGE_GLOBAL flag for these areas.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205515.2977EE7D@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f561fce
    • D
      x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init · 639d6aaf
      Dave Hansen 提交于
      __ro_after_init data gets stuck in the .rodata section.  That's normally
      fine because the kernel itself manages the R/W properties.
      
      But, if we run __change_page_attr() on an area which is __ro_after_init,
      the .rodata checks will trigger and force the area to be immediately
      read-only, even if it is early-ish in boot.  This caused problems when
      trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
      it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
      to clear _PAGE_RW.  The kernel then oopses the next time it wrote to
      a __ro_after_init data structure.
      
      To fix this, add the kernel_set_to_readonly check, just like we have
      for kernel text, just a few lines below in this function.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      639d6aaf
    • D
      x86/mm: Remove extra filtering in pageattr code · 1a54420a
      Dave Hansen 提交于
      The pageattr code has a mode where it can set or clear PTE bits in
      existing PTEs, so the page protections of the *new* PTEs come from
      one of two places:
      
        1. The set/clear masks: cpa->mask_clr / cpa->mask_set
        2. The existing PTE
      
      We filter ->mask_set/clr for supported PTE bits at entry to
      __change_page_attr() so we never need to filter them again.
      
      The only other place permissions can come from is an existing PTE
      and those already presumably have good bits.  We do not need to filter
      them again.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a54420a
    • D
      x86/mm: Do not auto-massage page protections · fb43d6cb
      Dave Hansen 提交于
      A PTE is constructed from a physical address and a pgprotval_t.
      __PAGE_KERNEL, for instance, is a pgprot_t and must be converted
      into a pgprotval_t before it can be used to create a PTE.  This is
      done implicitly within functions like pfn_pte() by massage_pgprot().
      
      However, this makes it very challenging to set bits (and keep them
      set) if your bit is being filtered out by massage_pgprot().
      
      This moves the bit filtering out of pfn_pte() and friends.  For
      users of PAGE_KERNEL*, filtering will be done automatically inside
      those macros but for users of __PAGE_KERNEL*, they need to do their
      own filtering now.
      
      Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
      instead of massage_pgprot().  This way, we still *look* for
      unsupported bits and properly warn about them if we find them.  This
      might happen if an unfiltered __PAGE_KERNEL* value was passed in,
      for instance.
      
      - printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
      - boot crash fix from:            Tom Lendacky <thomas.lendacky@amd.com>
      - crash bisected by:              Mike Galbraith <efault@gmx.de>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reported-and-fixed-by: NArnd Bergmann <arnd@arndb.de>
      Fixed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Bisected-by: NMike Galbraith <efault@gmx.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb43d6cb
    • P
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin 提交于
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
  4. 10 4月, 2018 3 次提交
    • D
      x86/mm: Introduce "default" kernel PTE mask · 8a57f484
      Dave Hansen 提交于
      The __PAGE_KERNEL_* page permissions are "raw".  They contain bits
      that may or may not be supported on the current processor.  They need
      to be filtered by a mask (currently __supported_pte_mask) to turn them
      into a value that we can actually set in a PTE.
      
      These __PAGE_KERNEL_* values all contain _PAGE_GLOBAL.  But, with PTI,
      we want to be able to support _PAGE_GLOBAL (have the bit set in
      __supported_pte_mask) but not have it appear in any of these masks by
      default.
      
      This patch creates a new mask, __default_kernel_pte_mask, and applies
      it when creating all of the PAGE_KERNEL_* masks.  This makes
      PAGE_KERNEL_* safe to use anywhere (they only contain supported bits).
      It also ensures that PAGE_KERNEL_* contains _PAGE_GLOBAL on PTI=n
      kernels but clears _PAGE_GLOBAL when PTI=y.
      
      We also make __default_kernel_pte_mask a non-GPL exported symbol
      because there are plenty of driver-available interfaces that take
      PAGE_KERNEL_* permissions.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205506.030DB6B6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8a57f484
    • D
      x86/mm: Undo double _PAGE_PSE clearing · 606c7193
      Dave Hansen 提交于
      When clearing _PAGE_PRESENT on a huge page, we need to be careful
      to also clear _PAGE_PSE, otherwise it might still get confused
      for a valid large page table entry.
      
      We do that near the spot where we *set* _PAGE_PSE.  That's fine,
      but it's unnecessary.  pgprot_large_2_4k() already did it.
      
      BTW, I also noticed that pgprot_large_2_4k() and
      pgprot_4k_2_large() are not symmetric.  pgprot_large_2_4k() clears
      _PAGE_PSE (because it is aliased to _PAGE_PAT) but
      pgprot_4k_2_large() does not put _PAGE_PSE back.  Bummer.
      
      Also, add some comments and change "promote" to "move".  "Promote"
      seems an odd word to move when we are logically moving a bit to a
      lower bit position.  Also add an extra line return to make it clear
      to which line the comment applies.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205504.9B0F44A9@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      606c7193
    • D
      x86/mm: Factor out pageattr _PAGE_GLOBAL setting · d1440b23
      Dave Hansen 提交于
      The pageattr code has a pattern repeated where it sets _PAGE_GLOBAL
      for present PTEs but clears it for non-present PTEs.  The intention
      is to keep _PAGE_GLOBAL from getting confused with _PAGE_PROTNONE
      since _PAGE_GLOBAL is for present PTEs and _PAGE_PROTNONE is for
      non-present
      
      But, this pattern makes no sense.  Effectively, it says, if you use
      the pageattr code, always set _PAGE_GLOBAL when _PAGE_PRESENT.
      canon_pgprot() will clear it if unsupported (because it masks the
      value with __supported_pte_mask) but we *always* set it. Even if
      canon_pgprot() did not filter _PAGE_GLOBAL, it would be OK.
      _PAGE_GLOBAL is ignored when CR4.PGE=0 by the hardware.
      
      This unconditional setting of _PAGE_GLOBAL is a problem when we have
      PTI and non-PTI and we want some areas to have _PAGE_GLOBAL and some
      not.
      
      This updated version of the code says:
      1. Clear _PAGE_GLOBAL when !_PAGE_PRESENT
      2. Never set _PAGE_GLOBAL implicitly
      3. Allow _PAGE_GLOBAL to be in cpa.set_mask
      4. Allow _PAGE_GLOBAL to be inherited from previous PTE
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205502.86E199DA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1440b23
  5. 06 4月, 2018 1 次提交
    • P
      x86/mm/memory_hotplug: determine block size based on the end of boot memory · 078eb6aa
      Pavel Tatashin 提交于
      Memory sections are combined into "memory block" chunks.  These chunks
      are the units upon which memory can be added and removed.
      
      On x86, the new memory may be added after the end of the boot memory,
      therefore, if block size does not align with end of boot memory, memory
      hot-plugging/hot-removing can be broken.
      
      Memory sections are combined into "memory block" chunks.  These chunks
      are the units upon which memory can be added and removed.
      
      On x86 the new memory may be added after the end of the boot memory,
      therefore, if block size does not align with end of boot memory, memory
      hotplugging/hotremoving can be broken.
      
      Currently, whenever machine is booted with more than 64G the block size
      is unconditionally increased to 2G from the base 128M.  This is done in
      order to reduce number of memory device files in sysfs:
      
      	/sys/devices/system/memory/memoryXXX
      
      We must use the largest allowed block size that aligns to the next
      address to be able to hotplug the next block of memory.
      
      So, when memory is larger or equal to 64G, we check the end address and
      find the largest block size that is still power of two but smaller or
      equal to 2G.
      
      Before, the fix:
      Run qemu with:
      -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
      
      (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
      							size 0x80000000
      acpi PNP0C80:00: add_memory failed
      acpi PNP0C80:00: acpi_memory_enable_device() error
      acpi PNP0C80:00: Enumeration failure
      
      With the fix memory is added successfully as the block size is set to
      1G, and therefore aligns with start address 0x1040000000.
      
      [pasha.tatashin@oracle.com: v4]
        Link: http://lkml.kernel.org/r/20180215165920.8570-3-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180213193159.14606-3-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Bharata B Rao <bharata@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      078eb6aa
  6. 27 3月, 2018 1 次提交
  7. 23 3月, 2018 2 次提交
    • T
      x86/mm: implement free pmd/pte page interfaces · 28ee90fe
      Toshi Kani 提交于
      Implement pud_free_pmd_page() and pmd_free_pte_page() on x86, which
      clear a given pud/pmd entry and free up lower level page table(s).
      
      The address range associated with the pud/pmd entry must have been
      purged by INVLPG.
      
      Link: http://lkml.kernel.org/r/20180314180155.19492-3-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28ee90fe
    • T
      mm/vmalloc: add interfaces to free unmapped page table · b6bdb751
      Toshi Kani 提交于
      On architectures with CONFIG_HAVE_ARCH_HUGE_VMAP set, ioremap() may
      create pud/pmd mappings.  A kernel panic was observed on arm64 systems
      with Cortex-A75 in the following steps as described by Hanjun Guo.
      
       1. ioremap a 4K size, valid page table will build,
       2. iounmap it, pte0 will set to 0;
       3. ioremap the same address with 2M size, pgd/pmd is unchanged,
          then set the a new value for pmd;
       4. pte0 is leaked;
       5. CPU may meet exception because the old pmd is still in TLB,
          which will lead to kernel panic.
      
      This panic is not reproducible on x86.  INVLPG, called from iounmap,
      purges all levels of entries associated with purged address on x86.  x86
      still has memory leak.
      
      The patch changes the ioremap path to free unmapped page table(s) since
      doing so in the unmap path has the following issues:
      
       - The iounmap() path is shared with vunmap(). Since vmap() only
         supports pte mappings, making vunmap() to free a pte page is an
         overhead for regular vmap users as they do not need a pte page freed
         up.
      
       - Checking if all entries in a pte page are cleared in the unmap path
         is racy, and serializing this check is expensive.
      
       - The unmap path calls free_vmap_area_noflush() to do lazy TLB purges.
         Clearing a pud/pmd entry before the lazy TLB purges needs extra TLB
         purge.
      
      Add two interfaces, pud_free_pmd_page() and pmd_free_pte_page(), which
      clear a given pud/pmd entry and free up a page for the lower level
      entries.
      
      This patch implements their stub functions on x86 and arm64, which work
      as workaround.
      
      [akpm@linux-foundation.org: fix typo in pmd_free_pte_page() stub]
      Link: http://lkml.kernel.org/r/20180314180155.19492-2-toshi.kani@hpe.com
      Fixes: e61ce6ad ("mm: change ioremap to set up huge I/O mappings")
      Reported-by: NLei Li <lious.lilei@hisilicon.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Wang Xuefeng <wxf.wang@hisilicon.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Chintan Pandya <cpandya@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6bdb751
  8. 20 3月, 2018 5 次提交
  9. 15 3月, 2018 3 次提交
    • T
      x86/mm: Remove pointless checks in vmalloc_fault · 565977a3
      Toshi Kani 提交于
      vmalloc_fault() sets user's pgd or p4d from the kernel page table.  Once
      it's set, all tables underneath are identical. There is no point of
      following the same page table with two separate pointers and make sure they
      see the same with BUG().
      
      Remove the pointless checks in vmalloc_fault(). Also rename the kernel
      pgd/p4d pointers to pgd_k/p4d_k so that their names are consistent in the
      file.
      Suggested-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Gratian Crisan <gratian.crisan@ni.com>
      Link: https://lkml.kernel.org/r/20180314205932.7193-1-toshi.kani@hpe.com
      565977a3
    • D
      x86, memremap: fix altmap accounting at free · a7e6c701
      Dan Williams 提交于
      Commit 24b6d416 "mm: pass the vmem_altmap to vmemmap_free" converted
      the vmemmap_free() path to pass the altmap argument all the way through
      the call chain rather than looking it up based on the page.
      Unfortunately that ends up over freeing altmap allocated pages in some
      cases since free_pagetable() is used to free both memmap space and pte
      space, where only the memmap stored in huge pages uses altmap
      allocations.
      
      Given that altmap allocations for memmap space are special cased in
      vmemmap_populate_hugepages() add a symmetric / special case
      free_hugepage_table() to handle altmap freeing, and cleanup the unneeded
      passing of altmap to leaf functions that do not require it.
      
      Without this change the sanity check accounting in
      devm_memremap_pages_release() will throw a warning with the following
      signature.
      
       nd_pmem pfn10.1: devm_memremap_pages_release: failed to free all reserved pages
       WARNING: CPU: 44 PID: 3539 at kernel/memremap.c:310 devm_memremap_pages_release+0x1c7/0x220
       CPU: 44 PID: 3539 Comm: ndctl Tainted: G             L   4.16.0-rc1-linux-stable #7
       RIP: 0010:devm_memremap_pages_release+0x1c7/0x220
       [..]
       Call Trace:
        release_nodes+0x225/0x270
        device_release_driver_internal+0x15d/0x210
        bus_remove_device+0xe2/0x160
        device_del+0x130/0x310
        ? klist_release+0x56/0x100
        ? nd_region_notify+0xc0/0xc0 [libnvdimm]
        device_unregister+0x16/0x60
      
      This was missed in testing since not all configurations will trigger
      this warning.
      
      Fixes: 24b6d416 ("mm: pass the vmem_altmap to vmemmap_free")
      Reported-by: NJane Chu <jane.chu@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a7e6c701
    • T
      x86/mm: Fix vmalloc_fault to use pXd_large · 18a95521
      Toshi Kani 提交于
      Gratian Crisan reported that vmalloc_fault() crashes when CONFIG_HUGETLBFS
      is not set since the function inadvertently uses pXn_huge(), which always
      return 0 in this case.  ioremap() does not depend on CONFIG_HUGETLBFS.
      
      Fix vmalloc_fault() to call pXd_large() instead.
      
      Fixes: f4eafd8b ("x86/mm: Fix vmalloc_fault() to handle large pages properly")
      Reported-by: NGratian Crisan <gratian.crisan@ni.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Link: https://lkml.kernel.org/r/20180313170347.3829-2-toshi.kani@hpe.com
      18a95521
  10. 12 3月, 2018 1 次提交
  11. 08 3月, 2018 2 次提交
  12. 01 3月, 2018 1 次提交
    • T
      x86/cpu_entry_area: Sync cpu_entry_area to initial_page_table · 945fd17a
      Thomas Gleixner 提交于
      The separation of the cpu_entry_area from the fixmap missed the fact that
      on 32bit non-PAE kernels the cpu_entry_area mapping might not be covered in
      initial_page_table by the previous synchronizations.
      
      This results in suspend/resume failures because 32bit utilizes initial page
      table for resume. The absence of the cpu_entry_area mapping results in a
      triple fault, aka. insta reboot.
      
      With PAE enabled this works by chance because the PGD entry which covers
      the fixmap and other parts incindentally provides the cpu_entry_area
      mapping as well.
      
      Synchronize the initial page table after setting up the cpu entry
      area. Instead of adding yet another copy of the same code, move it to a
      function and invoke it from the various places.
      
      It needs to be investigated if the existing calls in setup_arch() and
      setup_per_cpu_areas() can be replaced by the later invocation from
      setup_cpu_entry_areas(), but that's beyond the scope of this fix.
      
      Fixes: 92a0f81d ("x86/cpu_entry_area: Move it out of the fixmap")
      Reported-by: NWoody Suwalski <terraluna977@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NWoody Suwalski <terraluna977@gmail.com>
      Cc: William Grant <william.grant@canonical.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1802282137290.1392@nanos.tec.linutronix.de
      945fd17a
  13. 28 2月, 2018 1 次提交
  14. 26 2月, 2018 1 次提交
    • J
      x86/mm: Consider effective protection attributes in W+X check · 672c0ae0
      Jan Beulich 提交于
      Using just the leaf page table entry flags would cause a false warning
      in case _PAGE_RW is clear or _PAGE_NX is set in a higher level entry.
      Hand through both the current entry's flags as well as the accumulated
      effective value (the latter as pgprotval_t instead of pgprot_t, as it's
      not an actual entry's value).
      
      This in particular eliminates the false W+X warning when running under
      Xen, as commit:
      
        2cc42bac ("x86-64/Xen: eliminate W+X mappings")
      
      had to make the necessary adjustment in L2 rather than L1 (the reason is
      explained there). I.e. _PAGE_RW is clear there in L1, but _PAGE_NX is
      set in L2.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/5A8FDE8902000078001AABBB@prv-mh.provo.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      672c0ae0
  15. 21 2月, 2018 2 次提交
    • K
      x86/mm: Optimize boot-time paging mode switching cost · 39b95522
      Kirill A. Shutemov 提交于
      By this point we have functioning boot-time switching between 4- and
      5-level paging mode. But naive approach comes with cost.
      
      Numbers below are for kernel build, allmodconfig, 5 times.
      
      CONFIG_X86_5LEVEL=n:
      
       Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
      
         17308719.892691      task-clock:u (msec)       #   26.772 CPUs utilized            ( +-  0.11% )
                       0      context-switches:u        #    0.000 K/sec
                       0      cpu-migrations:u          #    0.000 K/sec
             331,993,164      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
      43,614,978,867,455      cycles:u                  #    2.520 GHz                      ( +-  0.01% )
      39,371,534,575,126      stalled-cycles-frontend:u #   90.27% frontend cycles idle     ( +-  0.09% )
      28,363,350,152,428      instructions:u            #    0.65  insn per cycle
                                                        #    1.39  stalled cycles per insn  ( +-  0.00% )
       6,316,784,066,413      branches:u                #  364.948 M/sec                    ( +-  0.00% )
         250,808,144,781      branch-misses:u           #    3.97% of all branches          ( +-  0.01% )
      
           646.531974142 seconds time elapsed                                          ( +-  1.15% )
      
      CONFIG_X86_5LEVEL=y:
      
       Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
      
         17411536.780625      task-clock:u (msec)       #   26.426 CPUs utilized            ( +-  0.10% )
                       0      context-switches:u        #    0.000 K/sec
                       0      cpu-migrations:u          #    0.000 K/sec
             331,868,663      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
      43,865,909,056,301      cycles:u                  #    2.519 GHz                      ( +-  0.01% )
      39,740,130,365,581      stalled-cycles-frontend:u #   90.59% frontend cycles idle     ( +-  0.05% )
      28,363,358,997,959      instructions:u            #    0.65  insn per cycle
                                                        #    1.40  stalled cycles per insn  ( +-  0.00% )
       6,316,784,937,460      branches:u                #  362.793 M/sec                    ( +-  0.00% )
         251,531,919,485      branch-misses:u           #    3.98% of all branches          ( +-  0.00% )
      
           658.886307752 seconds time elapsed                                          ( +-  0.92% )
      
      The patch tries to fix the performance regression by using
      cpu_feature_enabled(X86_FEATURE_LA57) instead of pgtable_l5_enabled in
      all hot code paths. These will statically patch the target code for
      additional performance.
      
      CONFIG_X86_5LEVEL=y + the patch:
      
       Performance counter stats for 'sh -c make -j100 -B -k >/dev/null' (5 runs):
      
         17381990.268506      task-clock:u (msec)       #   26.907 CPUs utilized            ( +-  0.19% )
                       0      context-switches:u        #    0.000 K/sec
                       0      cpu-migrations:u          #    0.000 K/sec
             331,862,625      page-faults:u             #    0.019 M/sec                    ( +-  0.01% )
      43,697,726,320,051      cycles:u                  #    2.514 GHz                      ( +-  0.03% )
      39,480,408,690,401      stalled-cycles-frontend:u #   90.35% frontend cycles idle     ( +-  0.05% )
      28,363,394,221,388      instructions:u            #    0.65  insn per cycle
                                                        #    1.39  stalled cycles per insn  ( +-  0.00% )
       6,316,794,985,573      branches:u                #  363.410 M/sec                    ( +-  0.00% )
         251,013,232,547      branch-misses:u           #    3.97% of all branches          ( +-  0.01% )
      
           645.991174661 seconds time elapsed                                          ( +-  1.19% )
      
      Unfortunately, this approach doesn't help with text size:
      
        vmlinux.before .text size:	8190319
        vmlinux.after .text size:	8200623
      
      The .text section is increased by about 4k. Not sure if we can do anything
      about this.
      Signed-off-by: NKirill A. Shuemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180216114948.68868-4-kirill.shutemov@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39b95522
    • P
      x86/mm/sme, objtool: Annotate indirect call in sme_encrypt_execute() · 531bb52a
      Peter Zijlstra 提交于
      This is boot code and thus Spectre-safe: we run this _way_ before userspace
      comes along to have a chance to poison our branch predictor.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      531bb52a
  16. 20 2月, 2018 1 次提交
  17. 16 2月, 2018 3 次提交
  18. 15 2月, 2018 1 次提交