1. 03 5月, 2018 2 次提交
    • D
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann 提交于
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • D
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann 提交于
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3aab8884
  2. 25 4月, 2018 1 次提交
    • G
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello 提交于
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1612a981
  3. 21 4月, 2018 1 次提交
  4. 20 4月, 2018 3 次提交
  5. 17 4月, 2018 7 次提交
  6. 16 4月, 2018 2 次提交
  7. 14 4月, 2018 11 次提交
  8. 13 4月, 2018 1 次提交
  9. 12 4月, 2018 11 次提交
    • A
      Revert "x86/asm: Allow again using asm.h when building for the 'bpf' clang target" · fd97d39b
      Arnaldo Carvalho de Melo 提交于
      This reverts commit ca26cffa.
      
      Newer clang versions accept that asm(_ASM_SP) construct, and now that
      the bpf-script-test-kbuild.c script, used in one of the 'perf test LLVM'
      subtests doesn't include ptrace.h, which ended up including
      arch/x86/include/asm/asm.h, we can revert this patch.
      Suggested-by: NYonghong Song <yhs@fb.com>
      Link: https://lkml.kernel.org/r/613f0a0d-c433-8f4d-dcc1-c9889deae39e@fb.comAcked-by: NYonghong Song <yhs@fb.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wang Nan <wangnan0@huawei.com>
      Link: https://lkml.kernel.org/n/tip-nqozcv8loq40tkqpfw997993@git.kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      fd97d39b
    • J
      x86/pgtable: Don't set huge PUD/PMD on non-leaf entries · e3e28812
      Joerg Roedel 提交于
      The pmd_set_huge() and pud_set_huge() functions are used from
      the generic ioremap() code to establish large mappings where this
      is possible.
      
      But the generic ioremap() code does not check whether the
      PMD/PUD entries are already populated with a non-leaf entry,
      so that any page-table pages these entries point to will be
      lost.
      
      Further, on x86-32 with SHARED_KERNEL_PMD=0, this causes a
      BUG_ON() in vmalloc_sync_one() when PMD entries are synced
      from swapper_pg_dir to the current page-table. This happens
      because the PMD entry from swapper_pg_dir was promoted to a
      huge-page entry while the current PGD still contains the
      non-leaf entry. Because both entries are present and point
      to a different page, the BUG_ON() triggers.
      
      This was actually triggered with pti-x32 enabled in a KVM
      virtual machine by the graphics driver.
      
      A real and better fix for that would be to improve the
      page-table handling in the generic ioremap() code. But that is
      out-of-scope for this patch-set and left for later work.
      Reported-by: NDavid H. Gutteridge <dhgutteridge@sympatico.ca>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Laight <David.Laight@aculab.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eduardo Valentin <eduval@amazon.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Waiman Long <llong@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: aliguori@amazon.com
      Cc: daniel.gruss@iaik.tugraz.at
      Cc: hughd@google.com
      Cc: keescook@google.com
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180411152437.GC15462@8bytes.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3e28812
    • D
      x86/pti: Leave kernel text global for !PCID · 8c06c774
      Dave Hansen 提交于
      Global pages are bad for hardening because they potentially let an
      exploit read the kernel image via a Meltdown-style attack which
      makes it easier to find gadgets.
      
      But, global pages are good for performance because they reduce TLB
      misses when making user/kernel transitions, especially when PCIDs
      are not available, such as on older hardware, or where a hypervisor
      has disabled them for some reason.
      
      This patch implements a basic, sane policy: If you have PCIDs, you
      only map a minimal amount of kernel text global.  If you do not have
      PCIDs, you map all kernel text global.
      
      This policy effectively makes PCIDs something that not only adds
      performance but a little bit of hardening as well.
      
      I ran a simple "lseek" microbenchmark[1] to test the benefit on
      a modern Atom microserver.  Most of the benefit comes from applying
      the series before this patch ("entry only"), but there is still a
      signifiant benefit from this patch.
      
        No Global Lines (baseline  ): 6077741 lseeks/sec
        88 Global Lines (entry only): 7528609 lseeks/sec (+23.9%)
        94 Global Lines (this patch): 8433111 lseeks/sec (+38.8%)
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205518.E3D989EB@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c06c774
    • D
      x86/pti: Never implicitly clear _PAGE_GLOBAL for kernel image · 39114b7a
      Dave Hansen 提交于
      Summary:
      
      In current kernels, with PTI enabled, no pages are marked Global. This
      potentially increases TLB misses.  But, the mechanism by which the Global
      bit is set and cleared is rather haphazard.  This patch makes the process
      more explicit.  In the end, it leaves us with Global entries in the page
      tables for the areas truly shared by userspace and kernel and increases
      TLB hit rates.
      
      The place this patch really shines in on systems without PCIDs.  In this
      case, we are using an lseek microbenchmark[1] to see how a reasonably
      non-trivial syscall behaves.  Higher is better:
      
        No Global pages (baseline): 6077741 lseeks/sec
        88 Global Pages (this set): 7528609 lseeks/sec (+23.9%)
      
      On a modern Skylake desktop with PCIDs, the benefits are tangible, but not
      huge for a kernel compile (lower is better):
      
        No Global pages (baseline): 186.951 seconds time elapsed  ( +-  0.35% )
        28 Global pages (this set): 185.756 seconds time elapsed  ( +-  0.09% )
                                     -1.195 seconds (-0.64%)
      
      I also re-checked everything using the lseek1 test[1]:
      
        No Global pages (baseline): 15783951 lseeks/sec
        28 Global pages (this set): 16054688 lseeks/sec
      			     +270737 lseeks/sec (+1.71%)
      
      The effect is more visible, but still modest.
      
      Details:
      
      The kernel page tables are inherited from head_64.S which rudely marks
      them as _PAGE_GLOBAL.  For PTI, we have been relying on the grace of
      $DEITY and some insane behavior in pageattr.c to clear _PAGE_GLOBAL.
      This patch tries to do better.
      
      First, stop filtering out "unsupported" bits from being cleared in the
      pageattr code.  It's fine to filter out *setting* these bits but it
      is insane to keep us from clearing them.
      
      Then, *explicitly* go clear _PAGE_GLOBAL from the kernel identity map.
      Do not rely on pageattr to do it magically.
      
      After this patch, we can see that "GLB" shows up in each copy of the
      page tables, that we have the same number of global entries in each
      and that they are the *same* entries.
      
        /sys/kernel/debug/page_tables/current_kernel:11
        /sys/kernel/debug/page_tables/current_user:11
        /sys/kernel/debug/page_tables/kernel:11
      
        9caae8ad6a1fb53aca2407ec037f612d  current_kernel.GLB
        9caae8ad6a1fb53aca2407ec037f612d  current_user.GLB
        9caae8ad6a1fb53aca2407ec037f612d  kernel.GLB
      
      A quick visual audit also shows that all the entries make sense.
      0xfffffe0000000000 is the cpu_entry_area and 0xffffffff81c00000
      is the entry/exit text:
      
        0xfffffe0000000000-0xfffffe0000002000           8K     ro                 GLB NX pte
        0xfffffe0000002000-0xfffffe0000003000           4K     RW                 GLB NX pte
        0xfffffe0000003000-0xfffffe0000006000          12K     ro                 GLB NX pte
        0xfffffe0000006000-0xfffffe0000007000           4K     ro                 GLB x  pte
        0xfffffe0000007000-0xfffffe000000d000          24K     RW                 GLB NX pte
        0xfffffe000002d000-0xfffffe000002e000           4K     ro                 GLB NX pte
        0xfffffe000002e000-0xfffffe000002f000           4K     RW                 GLB NX pte
        0xfffffe000002f000-0xfffffe0000032000          12K     ro                 GLB NX pte
        0xfffffe0000032000-0xfffffe0000033000           4K     ro                 GLB x  pte
        0xfffffe0000033000-0xfffffe0000039000          24K     RW                 GLB NX pte
        0xffffffff81c00000-0xffffffff81e00000           2M     ro         PSE     GLB x  pmd
      
      [1.] https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.cSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205517.C80FBE05@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      39114b7a
    • D
      x86/pti: Enable global pages for shared areas · 0f561fce
      Dave Hansen 提交于
      The entry/exit text and cpu_entry_area are mapped into userspace and
      the kernel.  But, they are not _PAGE_GLOBAL.  This creates unnecessary
      TLB misses.
      
      Add the _PAGE_GLOBAL flag for these areas.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205515.2977EE7D@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0f561fce
    • D
      x86/mm: Do not forbid _PAGE_RW before init for __ro_after_init · 639d6aaf
      Dave Hansen 提交于
      __ro_after_init data gets stuck in the .rodata section.  That's normally
      fine because the kernel itself manages the R/W properties.
      
      But, if we run __change_page_attr() on an area which is __ro_after_init,
      the .rodata checks will trigger and force the area to be immediately
      read-only, even if it is early-ish in boot.  This caused problems when
      trying to clear the _PAGE_GLOBAL bit for these area in the PTI code:
      it cleared _PAGE_GLOBAL like I asked, but also took it up on itself
      to clear _PAGE_RW.  The kernel then oopses the next time it wrote to
      a __ro_after_init data structure.
      
      To fix this, add the kernel_set_to_readonly check, just like we have
      for kernel text, just a few lines below in this function.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205514.8D898241@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      639d6aaf
    • D
      x86/mm: Comment _PAGE_GLOBAL mystery · 430d4005
      Dave Hansen 提交于
      I was mystified as to where the _PAGE_GLOBAL in the kernel page tables
      for kernel text came from.  I audited all the places I could find, but
      I missed one: head_64.S.
      
      The page tables that we create in here live for a long time, and they
      also have _PAGE_GLOBAL set, despite whether the processor supports it
      or not.  It's harmless, and we got *lucky* that the pageattr code
      accidentally clears it when we wipe it out of __supported_pte_mask and
      then later try to mark kernel text read-only.
      
      Comment some of these properties to make it easier to find and
      understand in the future.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205513.079BB265@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      430d4005
    • D
      x86/mm: Remove extra filtering in pageattr code · 1a54420a
      Dave Hansen 提交于
      The pageattr code has a mode where it can set or clear PTE bits in
      existing PTEs, so the page protections of the *new* PTEs come from
      one of two places:
      
        1. The set/clear masks: cpa->mask_clr / cpa->mask_set
        2. The existing PTE
      
      We filter ->mask_set/clr for supported PTE bits at entry to
      __change_page_attr() so we never need to filter them again.
      
      The only other place permissions can come from is an existing PTE
      and those already presumably have good bits.  We do not need to filter
      them again.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205511.BC072352@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1a54420a
    • D
      x86/mm: Do not auto-massage page protections · fb43d6cb
      Dave Hansen 提交于
      A PTE is constructed from a physical address and a pgprotval_t.
      __PAGE_KERNEL, for instance, is a pgprot_t and must be converted
      into a pgprotval_t before it can be used to create a PTE.  This is
      done implicitly within functions like pfn_pte() by massage_pgprot().
      
      However, this makes it very challenging to set bits (and keep them
      set) if your bit is being filtered out by massage_pgprot().
      
      This moves the bit filtering out of pfn_pte() and friends.  For
      users of PAGE_KERNEL*, filtering will be done automatically inside
      those macros but for users of __PAGE_KERNEL*, they need to do their
      own filtering now.
      
      Note that we also just move pfn_pte/pmd/pud() over to check_pgprot()
      instead of massage_pgprot().  This way, we still *look* for
      unsupported bits and properly warn about them if we find them.  This
      might happen if an unfiltered __PAGE_KERNEL* value was passed in,
      for instance.
      
      - printk format warning fix from: Arnd Bergmann <arnd@arndb.de>
      - boot crash fix from:            Tom Lendacky <thomas.lendacky@amd.com>
      - crash bisected by:              Mike Galbraith <efault@gmx.de>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reported-and-fixed-by: NArnd Bergmann <arnd@arndb.de>
      Fixed-by: NTom Lendacky <thomas.lendacky@amd.com>
      Bisected-by: NMike Galbraith <efault@gmx.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20180406205509.77E1D7F6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fb43d6cb
    • P
      xen, mm: allow deferred page initialization for xen pv domains · 6f84f8d1
      Pavel Tatashin 提交于
      Juergen Gross noticed that commit f7f99100 ("mm: stop zeroing memory
      during allocation in vmemmap") broke XEN PV domains when deferred struct
      page initialization is enabled.
      
      This is because the xen's PagePinned() flag is getting erased from
      struct pages when they are initialized later in boot.
      
      Juergen fixed this problem by disabling deferred pages on xen pv
      domains.  It is desirable, however, to have this feature available as it
      reduces boot time.  This fix re-enables the feature for pv-dmains, and
      fixes the problem the following way:
      
      The fix is to delay setting PagePinned flag until struct pages for all
      allocated memory are initialized, i.e.  until after free_all_bootmem().
      
      A new x86_init.hyper op init_after_bootmem() is called to let xen know
      that boot allocator is done, and hence struct pages for all the
      allocated memory are now initialized.  If deferred page initialization
      is enabled, the rest of struct pages are going to be initialized later
      in boot once page_alloc_init_late() is called.
      
      xen_after_bootmem() walks page table's pages and marks them pinned.
      
      Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Tested-by: NJuergen Gross <jgross@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Alok Kataria <akataria@vmware.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Jinbum Park <jinb.park7@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jia Zhang <zhang.jia@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f84f8d1
    • K
      exec: pass stack rlimit into mm layout functions · 8f2af155
      Kees Cook 提交于
      Patch series "exec: Pin stack limit during exec".
      
      Attempts to solve problems with the stack limit changing during exec
      continue to be frustrated[1][2].  In addition to the specific issues
      around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
      other places during exec where the stack limit is used and is assumed to
      be unchanging.  Given the many places it gets used and the fact that it
      can be manipulated/raced via setrlimit() and prlimit(), I think the only
      way to handle this is to move away from the "current" view of the stack
      limit and instead attach it to the bprm, and plumb this down into the
      functions that need to know the stack limits.  This series implements
      the approach.
      
      [1] 04e35f44 ("exec: avoid RLIMIT_STACK races with prlimit()")
      [2] 779f4e1c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
      [3] to security@kernel.org, "Subject: existing rlimit races?"
      
      This patch (of 3):
      
      Since it is possible that the stack rlimit can change externally during
      exec (either via another thread calling setrlimit() or another process
      calling prlimit()), provide a way to pass the rlimit down into the
      per-architecture mm layout functions so that the rlimit can stay in the
      bprm structure instead of sitting in the signal structure until exec is
      finalized.
      
      Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Cc: Brad Spengler <spender@grsecurity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f2af155
  10. 11 4月, 2018 1 次提交