1. 15 7月, 2017 1 次提交
  2. 11 7月, 2017 3 次提交
  3. 07 7月, 2017 1 次提交
  4. 22 6月, 2017 2 次提交
  5. 21 6月, 2017 1 次提交
    • D
      ARM: 8683/1: ARM32: Support mremap() for sigpage/vDSO · 280e87e9
      Dmitry Safonov 提交于
      CRIU restores application mappings on the same place where they
      were before Checkpoint. That means, that we need to move vDSO
      and sigpage during restore on exactly the same place where
      they were before C/R.
      
      Make mremap() code update mm->context.{sigpage,vdso} pointers
      during VMA move. Sigpage is used for landing after handling
      a signal - if the pointer is not updated during moving, the
      application might crash on any signal after mremap().
      
      vDSO pointer on ARM32 is used only for setting auxv at this moment,
      update it during mremap() in case of future usage.
      
      Without those updates, current work of CRIU on ARM32 is not reliable.
      Historically, we error Checkpointing if we find vDSO page on ARM32
      and suggest user to disable CONFIG_VDSO.
      But that's not correct - it goes from x86 where signal processing
      is ended in vDSO blob. For arm32 it's sigpage, which is not disabled
      with `CONFIG_VDSO=n'.
      
      Looks like C/R was working by luck - because userspace on ARM32 at
      this moment always sets SA_RESTORER.
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Christopher Covington <cov@codeaurora.org>
      Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
      280e87e9
  6. 19 6月, 2017 1 次提交
    • H
      mm: larger stack guard gap, between vmas · 1be7107f
      Hugh Dickins 提交于
      Stack guard page is a useful feature to reduce a risk of stack smashing
      into a different mapping. We have been using a single page gap which
      is sufficient to prevent having stack adjacent to a different mapping.
      But this seems to be insufficient in the light of the stack usage in
      userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
      used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
      which is 256kB or stack strings with MAX_ARG_STRLEN.
      
      This will become especially dangerous for suid binaries and the default
      no limit for the stack size limit because those applications can be
      tricked to consume a large portion of the stack and a single glibc call
      could jump over the guard page. These attacks are not theoretical,
      unfortunatelly.
      
      Make those attacks less probable by increasing the stack guard gap
      to 1MB (on systems with 4k pages; but make it depend on the page size
      because systems with larger base pages might cap stack allocations in
      the PAGE_SIZE units) which should cover larger alloca() and VLA stack
      allocations. It is obviously not a full fix because the problem is
      somehow inherent, but it should reduce attack space a lot.
      
      One could argue that the gap size should be configurable from userspace,
      but that can be done later when somebody finds that the new 1MB is wrong
      for some special case applications.  For now, add a kernel command line
      option (stack_guard_gap) to specify the stack gap size (in page units).
      
      Implementation wise, first delete all the old code for stack guard page:
      because although we could get away with accounting one extra page in a
      stack vma, accounting a larger gap can break userspace - case in point,
      a program run with "ulimit -S -v 20000" failed when the 1MB gap was
      counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
      and strict non-overcommit mode.
      
      Instead of keeping gap inside the stack vma, maintain the stack guard
      gap as a gap between vmas: using vm_start_gap() in place of vm_start
      (or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
      places which need to respect the gap - mainly arch_get_unmapped_area(),
      and and the vma tree's subtree_gap support for that.
      Original-patch-by: NOleg Nesterov <oleg@redhat.com>
      Original-patch-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Helge Deller <deller@gmx.de> # parisc
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be7107f
  7. 04 5月, 2017 1 次提交
  8. 25 2月, 2017 5 次提交
  9. 23 2月, 2017 1 次提交
    • D
      powerpc: do not make the entire heap executable · 16e72e9b
      Denys Vlasenko 提交于
      On 32-bit powerpc the ELF PLT sections of binaries (built with
      --bss-plt, or with a toolchain which defaults to it) look like this:
      
        [17] .sbss             NOBITS          0002aff8 01aff8 000014 00  WA  0   0  4
        [18] .plt              NOBITS          0002b00c 01aff8 000084 00 WAX  0   0  4
        [19] .bss              NOBITS          0002b090 01aff8 0000a4 00  WA  0   0  4
      
      Which results in an ELF load header:
      
        Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
        LOAD           0x019c70 0x00029c70 0x00029c70 0x01388 0x014c4 RWE 0x10000
      
      This is all correct, the load region containing the PLT is marked as
      executable.  Note that the PLT starts at 0002b00c but the file mapping
      ends at 0002aff8, so the PLT falls in the 0 fill section described by
      the load header, and after a page boundary.
      
      Unfortunately the generic ELF loader ignores the X bit in the load
      headers when it creates the 0 filled non-file backed mappings.  It
      assumes all of these mappings are RW BSS sections, which is not the case
      for PPC.
      
      gcc/ld has an option (--secure-plt) to not do this, this is said to
      incur a small performance penalty.
      
      Currently, to support 32-bit binaries with PLT in BSS kernel maps
      *entire brk area* with executable rights for all binaries, even
      --secure-plt ones.
      
      Stop doing that.
      
      Teach the ELF loader to check the X bit in the relevant load header and
      create 0 filled anonymous mappings that are executable if the load
      header requests that.
      
      Test program showing the difference in /proc/$PID/maps:
      
      int main() {
      	char buf[16*1024];
      	char *p = malloc(123); /* make "[heap]" mapping appear */
      	int fd = open("/proc/self/maps", O_RDONLY);
      	int len = read(fd, buf, sizeof(buf));
      	write(1, buf, len);
      	printf("%p\n", p);
      	return 0;
      }
      
      Compiled using: gcc -mbss-plt -m32 -Os test.c -otest
      
      Unpatched ppc64 kernel:
      00100000-00120000 r-xp 00000000 00:00 0                                  [vdso]
      0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094                           /usr/lib/libc-2.17.so
      0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094                           /usr/lib/libc-2.17.so
      0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094                           /usr/lib/libc-2.17.so
      10000000-10010000 r-xp 00000000 fd:00 100674505                          /home/user/test
      10010000-10020000 r--p 00000000 fd:00 100674505                          /home/user/test
      10020000-10030000 rw-p 00010000 fd:00 100674505                          /home/user/test
      10690000-106c0000 rwxp 00000000 00:00 0                                  [heap]
      f7f70000-f7fa0000 r-xp 00000000 fd:00 67898089                           /usr/lib/ld-2.17.so
      f7fa0000-f7fb0000 r--p 00020000 fd:00 67898089                           /usr/lib/ld-2.17.so
      f7fb0000-f7fc0000 rw-p 00030000 fd:00 67898089                           /usr/lib/ld-2.17.so
      ffa90000-ffac0000 rw-p 00000000 00:00 0                                  [stack]
      0x10690008
      
      Patched ppc64 kernel:
      00100000-00120000 r-xp 00000000 00:00 0                                  [vdso]
      0fe10000-0ffd0000 r-xp 00000000 fd:00 67898094                           /usr/lib/libc-2.17.so
      0ffd0000-0ffe0000 r--p 001b0000 fd:00 67898094                           /usr/lib/libc-2.17.so
      0ffe0000-0fff0000 rw-p 001c0000 fd:00 67898094                           /usr/lib/libc-2.17.so
      10000000-10010000 r-xp 00000000 fd:00 100674505                          /home/user/test
      10010000-10020000 r--p 00000000 fd:00 100674505                          /home/user/test
      10020000-10030000 rw-p 00010000 fd:00 100674505                          /home/user/test
      10180000-101b0000 rw-p 00000000 00:00 0                                  [heap]
                        ^^^^ this has changed
      f7c60000-f7c90000 r-xp 00000000 fd:00 67898089                           /usr/lib/ld-2.17.so
      f7c90000-f7ca0000 r--p 00020000 fd:00 67898089                           /usr/lib/ld-2.17.so
      f7ca0000-f7cb0000 rw-p 00030000 fd:00 67898089                           /usr/lib/ld-2.17.so
      ff860000-ff890000 rw-p 00000000 00:00 0                                  [stack]
      0x10180008
      
      The patch was originally posted in 2012 by Jason Gunthorpe
      and apparently ignored:
      
      https://lkml.org/lkml/2012/9/30/138
      
      Lightly run-tested.
      
      Link: http://lkml.kernel.org/r/20161215131950.23054-1-dvlasenk@redhat.comSigned-off-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Tested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16e72e9b
  10. 20 2月, 2017 1 次提交
  11. 25 12月, 2016 1 次提交
  12. 08 10月, 2016 6 次提交
  13. 15 9月, 2016 1 次提交
  14. 26 8月, 2016 1 次提交
    • C
      arm64: Introduce execute-only page access permissions · cab15ce6
      Catalin Marinas 提交于
      The ARMv8 architecture allows execute-only user permissions by clearing
      the PTE_UXN and PTE_USER bits. However, the kernel running on a CPU
      implementation without User Access Override (ARMv8.2 onwards) can still
      access such page, so execute-only page permission does not protect
      against read(2)/write(2) etc. accesses. Systems requiring such
      protection must enable features like SECCOMP.
      
      This patch changes the arm64 __P100 and __S100 protection_map[] macros
      to the new __PAGE_EXECONLY attributes. A side effect is that
      pte_user() no longer triggers for __PAGE_EXECONLY since PTE_USER isn't
      set. To work around this, the check is done on the PTE_NG bit via the
      pte_ng() macro. VM_READ is also checked now for page faults.
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      cab15ce6
  15. 03 8月, 2016 1 次提交
  16. 29 7月, 2016 1 次提交
  17. 27 7月, 2016 3 次提交
  18. 08 7月, 2016 1 次提交
    • D
      x86/vdso: Add mremap hook to vm_special_mapping · b059a453
      Dmitry Safonov 提交于
      Add possibility for 32-bit user-space applications to move
      the vDSO mapping.
      
      Previously, when a user-space app called mremap() for the vDSO
      address, in the syscall return path it would land on the previous
      address of the vDSOpage, resulting in segmentation violation.
      
      Now it lands fine and returns to userspace with a remapped vDSO.
      
      This will also fix the context.vdso pointer for 64-bit, which does
      not affect the user of vDSO after mremap() currently, but this
      may change in the future.
      
      As suggested by Andy, return -EINVAL for mremap() that would
      split the vDSO image: that operation cannot possibly result in
      a working system so reject it.
      
      Renamed and moved the text_mapping structure declaration inside
      map_vdso(), as it used only there and now it complements the
      vvar_mapping variable.
      
      There is still a problem for remapping the vDSO in glibc
      applications: the linker relocates addresses for syscalls
      on the vDSO page, so you need to relink with the new
      addresses.
      
      Without that the next syscall through glibc may fail:
      
        Program received signal SIGSEGV, Segmentation fault.
        #0  0xf7fd9b80 in __kernel_vsyscall ()
        #1  0xf7ec8238 in _exit () from /usr/lib32/libc.so.6
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: 0x7f454c46@gmail.com
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160628113539.13606-2-dsafonov@virtuozzo.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b059a453
  19. 28 5月, 2016 1 次提交
    • L
      mm: remove more IS_ERR_VALUE abuses · 5d22fc25
      Linus Torvalds 提交于
      The do_brk() and vm_brk() return value was "unsigned long" and returned
      the starting address on success, and an error value on failure.  The
      reasons are entirely historical, and go back to it basically behaving
      like the mmap() interface does.
      
      However, nobody actually wanted that interface, and it causes totally
      pointless IS_ERR_VALUE() confusion.
      
      What every single caller actually wants is just the simpler integer
      return of zero for success and negative error number on failure.
      
      So just convert to that much clearer and more common calling convention,
      and get rid of all the IS_ERR_VALUE() uses wrt vm_brk().
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d22fc25
  20. 24 5月, 2016 4 次提交
    • M
      mm: make vm_brk killable · 2d6c9282
      Michal Hocko 提交于
      Now that all the callers handle vm_brk failure we can change it wait for
      mmap_sem killable to help oom_reaper to not get blocked just because
      vm_brk gets blocked behind mmap_sem readers.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6c9282
    • M
      mm: make vm_munmap killable · ae798783
      Michal Hocko 提交于
      Almost all current users of vm_munmap are ignoring the return value and
      so they do not handle potential error.  This means that some VMAs might
      stay behind.  This patch doesn't try to solve those potential problems.
      Quite contrary it adds a new failure mode by using down_write_killable
      in vm_munmap.  This should be safer than other failure modes, though,
      because the process is guaranteed to die as soon as it leaves the kernel
      and exit_mmap will clean the whole address space.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae798783
    • M
      mm: make vm_mmap killable · 9fbeb5ab
      Michal Hocko 提交于
      All the callers of vm_mmap seem to check for the failure already and
      bail out in one way or another on the error which means that we can
      change it to use killable version of vm_mmap_pgoff and return -EINTR if
      the current task gets killed while waiting for mmap_sem.  This also
      means that vm_mmap_pgoff can be killable by default and drop the
      additional parameter.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      
      Please note that load_elf_binary is ignoring vm_mmap error for
      current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
      problem because the address is not used anywhere and we never return to
      the userspace if we got killed.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fbeb5ab
    • M
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko 提交于
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly connected I
      have tried to keep the CC list as small as possible and people who I
      believed would be familiar are CCed only to the specific patches (all
      should have received the cover though).
      
      This patchset is based on linux-next and it depends on
      down_write_killable for rw_semaphores which got merged into tip
      locking/rwsem branch and it is merged into this next tree.  I guess it
      would be easiest to route these patches via mmotm because of the
      dependency on the tip tree but if respective maintainers prefer other
      way I have no objections.
      
      I haven't covered all the mmap_write(mm->mmap_sem) instances here
      
        $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
        98
        $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
        62
      
      I have tried to cover those which should be relatively easy to review in
      this series because this alone should be a nice improvement.  Other
      places can be changed on top.
      
      [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
      [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org
      
      This patch (of 18):
      
      This is the first step in making mmap_sem write waiters killable.  It
      focuses on the trivial ones which are taking the lock early after
      entering the syscall and they are not changing state before.
      
      Therefore it is very easy to change them to use down_write_killable and
      immediately return with -EINTR.  This will allow the waiter to pass away
      without blocking the mmap_sem which might be required to make a forward
      progress.  E.g.  the oom reaper will need the lock for reading to
      dismantle the OOM victim address space.
      
      The only tricky function in this patch is vm_mmap_pgoff which has many
      call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
      original non-killable semantic for now.
      
      vm_munmap callers do not bother checking the return value so open code
      it into the munmap syscall path for now for simplicity.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0ef0df
  21. 21 5月, 2016 1 次提交
  22. 20 5月, 2016 1 次提交
  23. 18 3月, 2016 1 次提交