1. 02 3月, 2022 3 次提交
  2. 20 1月, 2022 2 次提交
    • H
      fs/binfmt_elf: use PT_LOAD p_align values for static PIE · 9630f0d6
      H.J. Lu 提交于
      Extend commit ce81bb25 ("fs/binfmt_elf: use PT_LOAD p_align values
      for suitable start address") which fixed PIE binaries built with
      -Wl,-z,max-page-size=0x200000, to cover static PIE binaries.  This
      fixes:
      
          https://bugzilla.kernel.org/show_bug.cgi?id=215275
      
      Tested by verifying static PIE binaries with -Wl,-z,max-page-size=0x200000 loading.
      
      Link: https://lkml.kernel.org/r/20211209174052.370537-1-hjl.tools@gmail.comSigned-off-by: NH.J. Lu <hjl.tools@gmail.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9630f0d6
    • Y
      fs/binfmt_elf: replace open-coded string copy with get_task_comm · 95af469c
      Yafang Shao 提交于
      It is better to use get_task_comm() instead of the open coded string
      copy as we do in other places.
      
      struct elf_prpsinfo is used to dump the task information in userspace
      coredump or kernel vmcore.  Below is the verification of vmcore,
      
        crash> ps
           PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
              0      0   0  ffffffff9d21a940  RU   0.0       0      0  [swapper/0]
        >     0      0   1  ffffa09e40f85e80  RU   0.0       0      0  [swapper/1]
        >     0      0   2  ffffa09e40f81f80  RU   0.0       0      0  [swapper/2]
        >     0      0   3  ffffa09e40f83f00  RU   0.0       0      0  [swapper/3]
        >     0      0   4  ffffa09e40f80000  RU   0.0       0      0  [swapper/4]
        >     0      0   5  ffffa09e40f89f80  RU   0.0       0      0  [swapper/5]
              0      0   6  ffffa09e40f8bf00  RU   0.0       0      0  [swapper/6]
        >     0      0   7  ffffa09e40f88000  RU   0.0       0      0  [swapper/7]
        >     0      0   8  ffffa09e40f8de80  RU   0.0       0      0  [swapper/8]
        >     0      0   9  ffffa09e40f95e80  RU   0.0       0      0  [swapper/9]
        >     0      0  10  ffffa09e40f91f80  RU   0.0       0      0  [swapper/10]
        >     0      0  11  ffffa09e40f93f00  RU   0.0       0      0  [swapper/11]
        >     0      0  12  ffffa09e40f90000  RU   0.0       0      0  [swapper/12]
        >     0      0  13  ffffa09e40f9bf00  RU   0.0       0      0  [swapper/13]
        >     0      0  14  ffffa09e40f98000  RU   0.0       0      0  [swapper/14]
        >     0      0  15  ffffa09e40f9de80  RU   0.0       0      0  [swapper/15]
      
      It works well as expected.
      
      Some comments are added to explain why we use the hard-coded 16.
      
      Link: https://lkml.kernel.org/r/20211120112738.45980-5-laoar.shao@gmail.comSuggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
      Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
      Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95af469c
  3. 10 11月, 2021 2 次提交
    • A
      a43e5e3a
    • K
      binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE · 5f501d55
      Kees Cook 提交于
      Commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
      executable mappings") reverted back to using MAP_FIXED to map ELF LOAD
      segments because it was found that the segments in some binaries overlap
      and can cause MAP_FIXED_NOREPLACE to fail.
      
      The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to
      prevent the silent clobbering of an existing mapping (e.g.  stack) by
      the ELF image, which could lead to exploitable conditions.  Quoting
      commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map"),
      which originally introduced the use of MAP_FIXED_NOREPLACE in the
      loader:
      
          Both load_elf_interp and load_elf_binary rely on elf_map to map
          segments [to a specific] address and they use MAP_FIXED to enforce
          that. This is however [a] dangerous thing prone to silent data
          corruption which can be even exploitable.
          ...
          Let's take CVE-2017-1000253 as an example ... we could end up mapping
          [the executable] over the existing stack ... The [stack layout] issue
          has been fixed since then ... So we should be safe and any [similar]
          attack should be impractical. On the other hand this is just too
          subtle [an] assumption ... it can break quite easily and [be] hard to
          spot.
          ...
          Address this [weakness] by changing MAP_FIXED to the newly added
          MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is
          an existing mapping clashing with the requested one [instead of
          silently] clobbering it.
      
      Then processing ET_DYN binaries the loader already calculates a total
      size for the image when the first segment is mapped, maps the entire
      image, and then unmaps the remainder before the remaining segments are
      then individually mapped.
      
      To avoid the earlier problems (legitimate overlapping LOAD segments
      specified in the ELF), apply the same logic to ET_EXEC binaries as well.
      
      For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the
      initial total size mapping and then use MAP_FIXED to build the final
      (possibly legitimately overlapping) mappings.  For ET_DYN w/out INTERP,
      continue to map at a system-selected address in the mmap region.
      
      Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org
      Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.comCo-developed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Chen Jingwen <chenjingwen6@huawei.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f501d55
  4. 09 10月, 2021 1 次提交
    • E
      coredump: Limit coredumps to a single thread group · 0258b5fd
      Eric W. Biederman 提交于
      Today when a signal is delivered with a handler of SIG_DFL whose
      default behavior is to generate a core dump not only that process but
      every process that shares the mm is killed.
      
      In the case of vfork this looks like a real world problem.  Consider
      the following well defined sequence.
      
      	if (vfork() == 0) {
      		execve(...);
      		_exit(EXIT_FAILURE);
      	}
      
      If a signal that generates a core dump is received after vfork but
      before the execve changes the mm the process that called vfork will
      also be killed (as the mm is shared).
      
      Similarly if the execve fails after the point of no return the kernel
      delivers SIGSEGV which will kill both the exec'ing process and because
      the mm is shared the process that called vfork as well.
      
      As far as I can tell this behavior is a violation of people's
      reasonable expectations, POSIX, and is unnecessarily fragile when the
      system is low on memory.
      
      Solve this by making a userspace visible change to only kill a single
      process/thread group.  This is possible because Jann Horn recently
      modified[1] the coredump code so that the mm can safely be modified
      while the coredump is happening.  With LinuxThreads long gone I don't
      expect anyone to have a notice this behavior change in practice.
      
      To accomplish this move the core_state pointer from mm_struct to
      signal_struct, which allows different thread groups to coredump
      simultatenously.
      
      In zap_threads remove the work to kill anything except for the current
      thread group.
      
      v2: Remove core_state from the VM_BUG_ON_MM print to fix
          compile failure when CONFIG_DEBUG_VM is enabled.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      
      [1] a07279c9 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
      Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
      History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
      Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.auReviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0258b5fd
  5. 04 10月, 2021 1 次提交
    • C
      elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings · 9b2f72cc
      Chen Jingwen 提交于
      In commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
      executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
      load_elf_interp.
      
      Unfortunately, this will cause kernel to fail to start with:
      
          1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
          Failed to execute /init (error -17)
      
      The reason is that the elf interpreter (ld.so) has overlapping segments.
      
        readelf -l ld-2.31.so
        Program Headers:
          Type           Offset             VirtAddr           PhysAddr
                         FileSiz            MemSiz              Flags  Align
          LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                         0x000000000002c94c 0x000000000002c94c  R E    0x10000
          LOAD           0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
                         0x00000000000021e8 0x0000000000002320  RW     0x10000
          LOAD           0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
                         0x00000000000011ac 0x0000000000001328  RW     0x10000
      
      The reason for this problem is the same as described in commit
      ad55eac7 ("elf: enforce MAP_FIXED on overlaying elf segments").
      
      Not only executable binaries, elf interpreters (e.g. ld.so) can have
      overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
      back to MAP_FIXED in load_elf_interp.
      
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Cc: <stable@vger.kernel.org> # v4.19
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NChen Jingwen <chenjingwen6@huawei.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b2f72cc
  6. 04 9月, 2021 2 次提交
    • D
      binfmt: remove in-tree usage of MAP_DENYWRITE · 4589ff7c
      David Hildenbrand 提交于
      At exec time when we mmap the new executable via MAP_DENYWRITE we have it
      opened via do_open_execat() and already deny_write_access()'ed the file
      successfully. Once exec completes, we allow_write_acces(); however,
      we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
      also deny_write_access() as long as mm->exe_file remains set. We'll
      effectively deny write access to our executable via mm->exe_file
      until mm->exe_file is changed -- when the process is removed, on new
      exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).
      
      Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
      mm->exe_file.
      
      In case of an elf interpreter, we'll now only deny write access to the file
      during exec. This is somewhat okay, because the interpreter behaves
      (and sometime is) a shared library; all shared libraries, especially the
      ones loaded directly in user space like via dlopen() won't ever be mapped
      via MAP_DENYWRITE, because we ignore that from user space completely;
      these shared libraries can always be modified while mapped and executed.
      Let's only special-case the main executable, denying write access while
      being executed by a process. This can be considered a minor user space
      visible change.
      
      While this is a cleanup, it also fixes part of a problem reported with
      VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
      this patch and will be removed next:
        "Overlayfs did not honor positive i_writecount on realfile for
         VM_DENYWRITE mappings." [1]
      
      [1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/Reported-by: NChengguang Xu <cgxu519@mykernel.net>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      4589ff7c
    • D
      binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() · 42be8b42
      David Hildenbrand 提交于
      uselib() is the legacy systemcall for loading shared libraries.
      Nowadays, applications use dlopen() to load shared libraries, completely
      implemented in user space via mmap().
      
      For example, glibc uses MAP_COPY to mmap shared libraries. While this
      maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
      MAP_DENYWRITE specification from user space in mmap.
      
      With this change, all remaining in-tree users of MAP_DENYWRITE use it
      to map an executable. We will be able to open shared libraries loaded
      via uselib() writable, just as we already can via dlopen() from user
      space.
      
      This is one step into the direction of removing MAP_DENYWRITE from the
      kernel. This can be considered a minor user space visible change.
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      42be8b42
  7. 30 6月, 2021 1 次提交
  8. 18 6月, 2021 1 次提交
  9. 08 3月, 2021 1 次提交
    • A
      coredump: don't bother with do_truncate() · d0f1088b
      Al Viro 提交于
      have dump_skip() just remember how much needs to be skipped,
      leave actual seeks/writing zeroes to the next dump_emit()
      or the end of coredump output, whichever comes first.
      And instead of playing with do_truncate() in the end, just
      write one NUL at the end of the last gap (if any).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d0f1088b
  10. 16 2月, 2021 1 次提交
  11. 06 1月, 2021 1 次提交
    • A
      elf_prstatus: collect the common part (everything before pr_reg) into a struct · f2485a2d
      Al Viro 提交于
      Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
      the beginning of compat_elf_prstatus, take these fields into a separate
      structure (compat_elf_prstatus_common), so that it could be reused.  Due to
      the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
      need the same shape change done to native struct elf_prstatus, gathering the
      fields prior to pr_reg into a new structure (struct elf_prstatus_common).
      
      Fortunately, offset of pr_reg is always a multiple of 16 with no padding
      right before it, so it's possible to turn all the stuff prior to it into
      a single member without disturbing the layout.
      
      [build fix from Geert Uytterhoeven folded in]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f2485a2d
  12. 05 1月, 2021 1 次提交
    • A
      binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID · 8a00dd00
      Al Viro 提交于
      On 64bit architectures that support 32bit processes there are
      two possible layouts for NT_PRSTATUS note in ELF coredumps.
      For one thing, several fields are 64bit for native processes
      and 32bit for compat ones (pr_sigpend, etc.).  For another,
      the register dump is obviously different - the size and number
      of registers are not going to be the same for 32bit and 64bit
      variants of processor.
      
      Usually that's handled by having two structures - elf_prstatus
      for native layout and compat_elf_prstatus for 32bit one.
      32bit processes are handled by fs/compat_binfmt_elf.c, which
      defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
      Then it includes fs/binfmt_elf.c, which makes all references to
      struct elf_prstatus to be textually replaced with struct
      compat_elf_prstatus.  Ugly and somewhat brittle, but it works.
      
      However, amd64 is worse - there are _three_ possible layouts.
      One for native 64bit processes, another for i386 (32bit) processes
      and yet another for x32 (32bit address space with full 64bit
      registers).
      
      Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
      with usual compat_binfmt_elf.c trickery.  However, the layouts
      for i386 and x32 are not identical - they have the common beginning,
      but the register dump part (pr_reg) is bigger on x32.  Worse, pr_reg
      is not the last field - it's followed by int pr_fpvalid, so that
      field ends up at different offsets for i386 and x32 layouts.
      
      Fortunately, there's not much code that cares about any of that -
      it's all encapsulated in fill_thread_core_info().  Since x32
      variant is bigger, we define compat_elf_prstatus to match that
      layout.  That way i386 processes have enough space to fit
      their layout into.
      
      Moreover, since these layouts are identical prior to pr_reg,
      we don't need to distinguish x32 and i386 cases when we are
      setting the fields prior to pr_reg.
      
      Filling pr_reg itself is done by calling ->get() method of
      appropriate regset, and that method knows what layout (and size)
      to use.
      
      We do need to distinguish x32 and i386 cases only for two
      things: setting ->pr_fpvalid (offset differs for x32 and
      i386) and choosing the right size for our note.
      
      The way it's done is Not Nice, for the lack of more accurate
      printable description.  There are two macros (PRSTATUS_SIZE and
      SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
      and (S)->pr_fpvalid = 1.  On x86 asm/compat.h provides its own
      variants.
      
      Unfortunately, quite a few things go wrong there:
      	* PRSTATUS_SIZE doesn't use the normal test for process
      being an x32 one; it compares the size reported by regset with
      the size of pr_reg.
      	* it hardcodes the sizes of x32 and i386 variants (296 and 144
      resp.), so if some change in includes leads to asm/compat.h pulled
      in by fs/binfmt_elf.c we are in trouble - it will end up using
      the size of x32 variant for 64bit processes.
      	* it's in the wrong place; asm/compat.h couldn't define
      the structure for i386 layout, since it lacks quite a few types
      needed for it.  Hardcoded sizes are largely due to that.
      
      The proper fix would be to have an explicitly defined i386 variant
      of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
      TIF_X32 to choose the variant that should be used.  Unfortunately,
      that requires some manipulations of headers; we'll do that later
      in the series, but for now let's go with the minimal variant -
      rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
      have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
      and use the normal TIF_X32 check in that macro.  The size of i386 variant
      is kept hardcoded for now.  Similar story for SET_PR_FPVALID.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8a00dd00
  13. 11 12月, 2020 1 次提交
    • E
      coredump: Document coredump code exclusively used by cell spufs · c39ab6de
      Eric W. Biederman 提交于
      Oleg Nesterov recently asked[1] why is there an unshare_files in
      do_coredump.  After digging through all of the callers of lookup_fd it
      turns out that it is
      arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
      that needs the unshare_files in do_coredump.
      
      Looking at the history[2] this code was also the only piece of coredump code
      that required the unshare_files when the unshare_files was added.
      
      Looking at that code it turns out that cell is also the only
      architecture that implements elf_coredump_extra_notes_size and
      elf_coredump_extra_notes_write.
      
      I looked at the gdb repo[3] support for cell has been removed[4] in binutils
      2.34.  Geoff Levand reports he is still getting questions on how to
      run modern kernels on the PS3, from people using 3rd party firmware so
      this code is not dead.  According to Wikipedia the last PS3 shipped in
      Japan sometime in 2017.  So it will probably be a little while before
      everyone's hardware dies.
      
      Add some comments briefly documenting the coredump code that exists
      only to support cell spufs to make it easier to understand the
      coredump code.  Eventually the hardware will be dead, or their won't
      be userspace tools, or the coredump code will be refactored and it
      will be too difficult to update a dead architecture and these comments
      make it easy to tell where to pull to remove cell spufs support.
      
      [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
      [2] 179e037f ("do_coredump(): make sure that descriptor table isn't shared")
      [3] git://sourceware.org/git/binutils-gdb.git
      [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
      Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      c39ab6de
  14. 30 10月, 2020 1 次提交
  15. 26 10月, 2020 2 次提交
  16. 19 10月, 2020 1 次提交
    • J
      binfmt_elf: take the mmap lock around find_extend_vma() · b2767d97
      Jann Horn 提交于
      create_elf_tables() runs after setup_new_exec(), so other tasks can
      already access our new mm and do things like process_madvise() on it.  (At
      the time I'm writing this commit, process_madvise() is not in mainline
      yet, but has been in akpm's tree for some time.)
      
      While I believe that there are currently no APIs that would actually allow
      another process to mess up our VMA tree (process_madvise() is limited to
      MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
      under which no syscalls have been executed yet), this seems like an
      accident waiting to happen.
      
      Let's make sure that we always take the mmap lock around GUP paths as long
      as another process might be able to see the mm.
      
      (Yes, this diff looks suspicious because we drop the lock before doing
      anything with `vma`, but that's because we actually don't do anything with
      it apart from the NULL check.)
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichel Lespinasse <walken@google.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2767d97
  17. 17 10月, 2020 4 次提交
    • J
      binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot · a07279c9
      Jann Horn 提交于
      In both binfmt_elf and binfmt_elf_fdpic, use a new helper
      dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
      VMA, if we have one) while protected by the mmap_lock, and then use that
      snapshot instead of walking the VMA list without locking.
      
      An alternative approach would be to keep the mmap_lock held across the
      entire core dumping operation; however, keeping the mmap_lock locked while
      we may be blocked for an unbounded amount of time (e.g.  because we're
      dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
      blocks things like the ->release handler of userfaultfd, and we don't
      really want critical system daemons to grind to a halt just because
      someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
      something like that.
      
      Since both the normal ELF code and the FDPIC ELF code need this
      functionality (and if any other binfmt wants to add coredump support in
      the future, they'd probably need it, too), implement this with a common
      helper in fs/coredump.c.
      
      A downside of this approach is that we now need a bigger amount of kernel
      memory per userspace VMA in the normal ELF case, and that we need O(n)
      kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
      be terribly bad.
      
      There currently is a data race between stack expansion and anything that
      reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
      mitigate that for core dumping, take the mmap_lock in write mode when
      taking a snapshot of the VMA hierarchy.  (If we only took the mmap_lock in
      read mode, we could end up with a corrupted core dump if someone does
      get_user_pages_remote() concurrently.  Not really a major problem, but
      taking the mmap_lock either way works here, so we might as well avoid the
      issue.) (This doesn't do anything about the existing data races with stack
      expansion in other mm code.)
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a07279c9
    • J
      coredump: rework elf/elf_fdpic vma_dump_size() into common helper · 429a22e7
      Jann Horn 提交于
      At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
      different code to figure out which VMAs should be dumped, and if so,
      whether the dump should contain the entire VMA or just its first page.
      
      Eliminate duplicate code by reworking the binfmt_elf version into a
      generic core dumping helper in coredump.c.
      
      As part of that, change the heuristic for detecting executable/library
      header pages to check whether the inode is executable instead of looking
      at the file mode.
      
      This is less problematic in terms of locking because it lets us avoid
      get_user() under the mmap_sem.  (And arguably it looks nicer and makes
      more sense in generic code.)
      
      Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
      only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
      has been written to.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      429a22e7
    • J
      coredump: refactor page range dumping into common helper · afc63a97
      Jann Horn 提交于
      Both fs/binfmt_elf.c and fs/binfmt_elf_fdpic.c need to dump ranges of
      pages into the coredump file.  Extract that logic into a common helper.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-4-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afc63a97
    • C
      fs/binfmt_elf: use PT_LOAD p_align values for suitable start address · ce81bb25
      Chris Kennelly 提交于
      Patch series "Selecting Load Addresses According to p_align", v3.
      
      The current ELF loading mechancism provides page-aligned mappings.  This
      can lead to the program being loaded in a way unsuitable for file-backed,
      transparent huge pages when handling PIE executables.
      
      While specifying -z,max-page-size=0x200000 to the linker will generate
      suitably aligned segments for huge pages on x86_64, the executable needs
      to be loaded at a suitably aligned address as well.  This alignment
      requires the binary's cooperation, as distinct segments need to be
      appropriately paddded to be eligible for THP.
      
      For binaries built with increased alignment, this limits the number of
      bits usable for ASLR, but provides some randomization over using fixed
      load addresses/non-PIE binaries.
      
      This patch (of 2):
      
      The current ELF loading mechancism provides page-aligned mappings.  This
      can lead to the program being loaded in a way unsuitable for file-backed,
      transparent huge pages when handling PIE executables.
      
      For binaries built with increased alignment, this limits the number of
      bits usable for ASLR, but provides some randomization over using fixed
      load addresses/non-PIE binaries.
      
      Tested by verifying program with -Wl,-z,max-page-size=0x200000 loading.
      
      [akpm@linux-foundation.org: fix max() warning]
      [ckennelly@google.com: augment comment]
        Link: https://lkml.kernel.org/r/20200821233848.3904680-2-ckennelly@google.comSigned-off-by: NChris Kennelly <ckennelly@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Hugh Dickens <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Link: https://lkml.kernel.org/r/20200820170541.1132271-1-ckennelly@google.com
      Link: https://lkml.kernel.org/r/20200820170541.1132271-2-ckennelly@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce81bb25
  18. 28 7月, 2020 2 次提交
    • A
      kill elf_fpxregs_t · 7a896028
      Al Viro 提交于
      all uses are conditional upon ELF_CORE_COPY_XFPREGS, which has not
      been defined on any architecture since 2010
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7a896028
    • A
      introduction of regset ->get() wrappers, switching ELF coredumps to those · b4e9c954
      Al Viro 提交于
      Two new helpers: given a process and regset, dump into a buffer.
      regset_get() takes a buffer and size, regset_get_alloc() takes size
      and allocates a buffer.
      
      Return value in both cases is the amount of data actually dumped in
      case of success or -E...  on error.
      
      In both cases the size is capped by regset->n * regset->size, so
      ->get() is called with offset 0 and size no more than what regset
      expects.
      
      binfmt_elf.c callers of ->get() are switched to using those; the other
      caller (copy_regset_to_user()) will need some preparations to switch.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b4e9c954
  19. 05 6月, 2020 1 次提交
  20. 04 6月, 2020 1 次提交
  21. 29 5月, 2020 1 次提交
  22. 21 5月, 2020 1 次提交
    • E
      exec: Generic execfd support · b8a61c9e
      Eric W. Biederman 提交于
      Most of the support for passing the file descriptor of an executable
      to an interpreter already lives in the generic code and in binfmt_elf.
      Rework the fields in binfmt_elf that deal with executable file
      descriptor passing to make executable file descriptor passing a first
      class concept.
      
      Move the fd_install from binfmt_misc into begin_new_exec after the new
      creds have been installed.  This means that accessing the file through
      /proc/<pid>/fd/N is able to see the creds for the new executable
      before allowing access to the new executables files.
      
      Performing the install of the executables file descriptor after
      the point of no return also means that nothing special needs to
      be done on error.  The exiting of the process will close all
      of it's open files.
      
      Move the would_dump from binfmt_misc into begin_new_exec right
      after would_dump is called on the bprm->file.  This makes it
      obvious this case exists and that no nesting of bprm->file is
      currently supported.
      
      In binfmt_misc the movement of fd_install into generic code means
      that it's special error exit path is no longer needed.
      
      Link: https://lkml.kernel.org/r/87y2poyd91.fsf_-_@x220.int.ebiederm.orgAcked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      b8a61c9e
  23. 08 5月, 2020 2 次提交
  24. 06 5月, 2020 2 次提交
  25. 08 4月, 2020 4 次提交