1. 25 11月, 2022 3 次提交
    • A
      [elf] unify regset and non-regset cases · e92edb85
      Al Viro 提交于
      The only real difference is in filling per-thread notes - getting
      the values of registers.   And this is the only part that is worth
      an ifdef - we don't need to duplicate the logics regarding gathering
      threads, filling other notes, etc.
      
      It would've been hard to do back when regset-based variant had been
      introduced, mostly due to sharing bits and pieces of helpers with
      aout coredumps.  As the result, too much had been duplicated and
      the copies had drifted away since then.  Now it can be done cleanly...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e92edb85
    • A
      [elf][non-regset] use elf_core_copy_task_regs() for dumper as well · e961d370
      Al Viro 提交于
      elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of
      current on all architectures.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e961d370
    • A
      [elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) · bdbadfcc
      Al Viro 提交于
      Don't bother with pointless macros - we are not sharing it with aout coredumps
      anymore.  Just convert the underlying functions to the same arguments (nobody
      uses regs, actually) and call them elf_core_copy_task_fpregs().  And unexport
      the entire bunch, while we are at it.
      
      [added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
      warnings about the lack of externs getting added to huge piles for those
      files.  Pointless, but...]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bdbadfcc
  2. 24 10月, 2022 3 次提交
  3. 16 4月, 2022 2 次提交
  4. 19 3月, 2022 1 次提交
    • R
      binfmt_elf: Don't write past end of notes for regset gap · dd664099
      Rick Edgecombe 提交于
      In fill_thread_core_info() the ptrace accessible registers are collected
      to be written out as notes in a core file. The note array is allocated
      from a size calculated by iterating the user regset view, and counting the
      regsets that have a non-zero core_note_type. However, this only allows for
      there to be non-zero core_note_type at the end of the regset view. If
      there are any gaps in the middle, fill_thread_core_info() will overflow the
      note allocation, as it iterates over the size of the view and the
      allocation would be smaller than that.
      
      There doesn't appear to be any arch that has gaps such that they exceed
      the notes allocation, but the code is brittle and tries to support
      something it doesn't. It could be fixed by increasing the allocation size,
      but instead just have the note collecting code utilize the array better.
      This way the allocation can stay smaller.
      
      Even in the case of no arch's that have gaps in their regset views, this
      introduces a change in the resulting indicies of t->notes. It does not
      introduce any changes to the core file itself, because any blank notes are
      skipped in write_note_info().
      
      In case, the allocation logic between fill_note_info() and
      fill_thread_core_info() ever diverges from the usage logic, warn and skip
      writing any notes that would overflow the array.
      
      This fix is derrived from an earlier one[0] by Yu-cheng Yu.
      
      [0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/Co-developed-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
      Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com
      dd664099
  5. 09 3月, 2022 3 次提交
  6. 04 3月, 2022 1 次提交
    • K
      binfmt_elf: Introduce KUnit test · 9e1a3ce0
      Kees Cook 提交于
      Adds simple KUnit test for some binfmt_elf internals: specifically a
      regression test for the problem fixed by commit 8904d9cd90ee ("ELF:
      fix overflow in total mapping size calculation").
      
      $ ./tools/testing/kunit/kunit.py run --arch x86_64 \
          --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf'
      ...
      [19:41:08] ================== binfmt_elf (1 subtest) ==================
      [19:41:08] [PASSED] total_mapping_size_test
      [19:41:08] =================== [PASSED] binfmt_elf ====================
      [19:41:08] ============== compat_binfmt_elf (1 subtest) ===============
      [19:41:08] [PASSED] total_mapping_size_test
      [19:41:08] ================ [PASSED] compat_binfmt_elf ================
      [19:41:08] ============================================================
      [19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0
      
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: David Gow <davidgow@google.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de>
      Cc: kunit-dev@googlegroups.com
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      ---
      v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org
      v2:
       - improve commit log
       - fix comment URL (Daniel)
       - drop redundant KUnit Kconfig help info (Daniel)
       - note in Kconfig help that COMPAT builds add a compat test (David)
      9e1a3ce0
  7. 02 3月, 2022 5 次提交
  8. 12 2月, 2022 1 次提交
  9. 20 1月, 2022 2 次提交
    • H
      fs/binfmt_elf: use PT_LOAD p_align values for static PIE · 9630f0d6
      H.J. Lu 提交于
      Extend commit ce81bb25 ("fs/binfmt_elf: use PT_LOAD p_align values
      for suitable start address") which fixed PIE binaries built with
      -Wl,-z,max-page-size=0x200000, to cover static PIE binaries.  This
      fixes:
      
          https://bugzilla.kernel.org/show_bug.cgi?id=215275
      
      Tested by verifying static PIE binaries with -Wl,-z,max-page-size=0x200000 loading.
      
      Link: https://lkml.kernel.org/r/20211209174052.370537-1-hjl.tools@gmail.comSigned-off-by: NH.J. Lu <hjl.tools@gmail.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ian Rogers <irogers@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Fangrui Song <maskray@google.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9630f0d6
    • Y
      fs/binfmt_elf: replace open-coded string copy with get_task_comm · 95af469c
      Yafang Shao 提交于
      It is better to use get_task_comm() instead of the open coded string
      copy as we do in other places.
      
      struct elf_prpsinfo is used to dump the task information in userspace
      coredump or kernel vmcore.  Below is the verification of vmcore,
      
        crash> ps
           PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
              0      0   0  ffffffff9d21a940  RU   0.0       0      0  [swapper/0]
        >     0      0   1  ffffa09e40f85e80  RU   0.0       0      0  [swapper/1]
        >     0      0   2  ffffa09e40f81f80  RU   0.0       0      0  [swapper/2]
        >     0      0   3  ffffa09e40f83f00  RU   0.0       0      0  [swapper/3]
        >     0      0   4  ffffa09e40f80000  RU   0.0       0      0  [swapper/4]
        >     0      0   5  ffffa09e40f89f80  RU   0.0       0      0  [swapper/5]
              0      0   6  ffffa09e40f8bf00  RU   0.0       0      0  [swapper/6]
        >     0      0   7  ffffa09e40f88000  RU   0.0       0      0  [swapper/7]
        >     0      0   8  ffffa09e40f8de80  RU   0.0       0      0  [swapper/8]
        >     0      0   9  ffffa09e40f95e80  RU   0.0       0      0  [swapper/9]
        >     0      0  10  ffffa09e40f91f80  RU   0.0       0      0  [swapper/10]
        >     0      0  11  ffffa09e40f93f00  RU   0.0       0      0  [swapper/11]
        >     0      0  12  ffffa09e40f90000  RU   0.0       0      0  [swapper/12]
        >     0      0  13  ffffa09e40f9bf00  RU   0.0       0      0  [swapper/13]
        >     0      0  14  ffffa09e40f98000  RU   0.0       0      0  [swapper/14]
        >     0      0  15  ffffa09e40f9de80  RU   0.0       0      0  [swapper/15]
      
      It works well as expected.
      
      Some comments are added to explain why we use the hard-coded 16.
      
      Link: https://lkml.kernel.org/r/20211120112738.45980-5-laoar.shao@gmail.comSuggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
      Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
      Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
      Cc: Andrii Nakryiko <andrii@kernel.org>
      Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95af469c
  10. 10 11月, 2021 2 次提交
    • A
      a43e5e3a
    • K
      binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE · 5f501d55
      Kees Cook 提交于
      Commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
      executable mappings") reverted back to using MAP_FIXED to map ELF LOAD
      segments because it was found that the segments in some binaries overlap
      and can cause MAP_FIXED_NOREPLACE to fail.
      
      The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to
      prevent the silent clobbering of an existing mapping (e.g.  stack) by
      the ELF image, which could lead to exploitable conditions.  Quoting
      commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map"),
      which originally introduced the use of MAP_FIXED_NOREPLACE in the
      loader:
      
          Both load_elf_interp and load_elf_binary rely on elf_map to map
          segments [to a specific] address and they use MAP_FIXED to enforce
          that. This is however [a] dangerous thing prone to silent data
          corruption which can be even exploitable.
          ...
          Let's take CVE-2017-1000253 as an example ... we could end up mapping
          [the executable] over the existing stack ... The [stack layout] issue
          has been fixed since then ... So we should be safe and any [similar]
          attack should be impractical. On the other hand this is just too
          subtle [an] assumption ... it can break quite easily and [be] hard to
          spot.
          ...
          Address this [weakness] by changing MAP_FIXED to the newly added
          MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is
          an existing mapping clashing with the requested one [instead of
          silently] clobbering it.
      
      Then processing ET_DYN binaries the loader already calculates a total
      size for the image when the first segment is mapped, maps the entire
      image, and then unmaps the remainder before the remaining segments are
      then individually mapped.
      
      To avoid the earlier problems (legitimate overlapping LOAD segments
      specified in the ELF), apply the same logic to ET_EXEC binaries as well.
      
      For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the
      initial total size mapping and then use MAP_FIXED to build the final
      (possibly legitimately overlapping) mappings.  For ET_DYN w/out INTERP,
      continue to map at a system-selected address in the mmap region.
      
      Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org
      Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.comCo-developed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Chen Jingwen <chenjingwen6@huawei.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrei Vagin <avagin@openvz.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f501d55
  11. 09 10月, 2021 1 次提交
    • E
      coredump: Limit coredumps to a single thread group · 0258b5fd
      Eric W. Biederman 提交于
      Today when a signal is delivered with a handler of SIG_DFL whose
      default behavior is to generate a core dump not only that process but
      every process that shares the mm is killed.
      
      In the case of vfork this looks like a real world problem.  Consider
      the following well defined sequence.
      
      	if (vfork() == 0) {
      		execve(...);
      		_exit(EXIT_FAILURE);
      	}
      
      If a signal that generates a core dump is received after vfork but
      before the execve changes the mm the process that called vfork will
      also be killed (as the mm is shared).
      
      Similarly if the execve fails after the point of no return the kernel
      delivers SIGSEGV which will kill both the exec'ing process and because
      the mm is shared the process that called vfork as well.
      
      As far as I can tell this behavior is a violation of people's
      reasonable expectations, POSIX, and is unnecessarily fragile when the
      system is low on memory.
      
      Solve this by making a userspace visible change to only kill a single
      process/thread group.  This is possible because Jann Horn recently
      modified[1] the coredump code so that the mm can safely be modified
      while the coredump is happening.  With LinuxThreads long gone I don't
      expect anyone to have a notice this behavior change in practice.
      
      To accomplish this move the core_state pointer from mm_struct to
      signal_struct, which allows different thread groups to coredump
      simultatenously.
      
      In zap_threads remove the work to kill anything except for the current
      thread group.
      
      v2: Remove core_state from the VM_BUG_ON_MM print to fix
          compile failure when CONFIG_DEBUG_VM is enabled.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      
      [1] a07279c9 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
      Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
      History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
      Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
      Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.auReviewed-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      0258b5fd
  12. 04 10月, 2021 1 次提交
    • C
      elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings · 9b2f72cc
      Chen Jingwen 提交于
      In commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
      executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
      load_elf_interp.
      
      Unfortunately, this will cause kernel to fail to start with:
      
          1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
          Failed to execute /init (error -17)
      
      The reason is that the elf interpreter (ld.so) has overlapping segments.
      
        readelf -l ld-2.31.so
        Program Headers:
          Type           Offset             VirtAddr           PhysAddr
                         FileSiz            MemSiz              Flags  Align
          LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                         0x000000000002c94c 0x000000000002c94c  R E    0x10000
          LOAD           0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
                         0x00000000000021e8 0x0000000000002320  RW     0x10000
          LOAD           0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
                         0x00000000000011ac 0x0000000000001328  RW     0x10000
      
      The reason for this problem is the same as described in commit
      ad55eac7 ("elf: enforce MAP_FIXED on overlaying elf segments").
      
      Not only executable binaries, elf interpreters (e.g. ld.so) can have
      overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
      back to MAP_FIXED in load_elf_interp.
      
      Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
      Cc: <stable@vger.kernel.org> # v4.19
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NChen Jingwen <chenjingwen6@huawei.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b2f72cc
  13. 04 9月, 2021 2 次提交
    • D
      binfmt: remove in-tree usage of MAP_DENYWRITE · 4589ff7c
      David Hildenbrand 提交于
      At exec time when we mmap the new executable via MAP_DENYWRITE we have it
      opened via do_open_execat() and already deny_write_access()'ed the file
      successfully. Once exec completes, we allow_write_acces(); however,
      we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
      also deny_write_access() as long as mm->exe_file remains set. We'll
      effectively deny write access to our executable via mm->exe_file
      until mm->exe_file is changed -- when the process is removed, on new
      exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).
      
      Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
      mm->exe_file.
      
      In case of an elf interpreter, we'll now only deny write access to the file
      during exec. This is somewhat okay, because the interpreter behaves
      (and sometime is) a shared library; all shared libraries, especially the
      ones loaded directly in user space like via dlopen() won't ever be mapped
      via MAP_DENYWRITE, because we ignore that from user space completely;
      these shared libraries can always be modified while mapped and executed.
      Let's only special-case the main executable, denying write access while
      being executed by a process. This can be considered a minor user space
      visible change.
      
      While this is a cleanup, it also fixes part of a problem reported with
      VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
      this patch and will be removed next:
        "Overlayfs did not honor positive i_writecount on realfile for
         VM_DENYWRITE mappings." [1]
      
      [1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/Reported-by: NChengguang Xu <cgxu519@mykernel.net>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      4589ff7c
    • D
      binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() · 42be8b42
      David Hildenbrand 提交于
      uselib() is the legacy systemcall for loading shared libraries.
      Nowadays, applications use dlopen() to load shared libraries, completely
      implemented in user space via mmap().
      
      For example, glibc uses MAP_COPY to mmap shared libraries. While this
      maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
      MAP_DENYWRITE specification from user space in mmap.
      
      With this change, all remaining in-tree users of MAP_DENYWRITE use it
      to map an executable. We will be able to open shared libraries loaded
      via uselib() writable, just as we already can via dlopen() from user
      space.
      
      This is one step into the direction of removing MAP_DENYWRITE from the
      kernel. This can be considered a minor user space visible change.
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      42be8b42
  14. 30 6月, 2021 1 次提交
  15. 18 6月, 2021 1 次提交
  16. 08 3月, 2021 1 次提交
    • A
      coredump: don't bother with do_truncate() · d0f1088b
      Al Viro 提交于
      have dump_skip() just remember how much needs to be skipped,
      leave actual seeks/writing zeroes to the next dump_emit()
      or the end of coredump output, whichever comes first.
      And instead of playing with do_truncate() in the end, just
      write one NUL at the end of the last gap (if any).
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d0f1088b
  17. 16 2月, 2021 1 次提交
  18. 06 1月, 2021 1 次提交
    • A
      elf_prstatus: collect the common part (everything before pr_reg) into a struct · f2485a2d
      Al Viro 提交于
      Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
      the beginning of compat_elf_prstatus, take these fields into a separate
      structure (compat_elf_prstatus_common), so that it could be reused.  Due to
      the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
      need the same shape change done to native struct elf_prstatus, gathering the
      fields prior to pr_reg into a new structure (struct elf_prstatus_common).
      
      Fortunately, offset of pr_reg is always a multiple of 16 with no padding
      right before it, so it's possible to turn all the stuff prior to it into
      a single member without disturbing the layout.
      
      [build fix from Geert Uytterhoeven folded in]
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f2485a2d
  19. 05 1月, 2021 1 次提交
    • A
      binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID · 8a00dd00
      Al Viro 提交于
      On 64bit architectures that support 32bit processes there are
      two possible layouts for NT_PRSTATUS note in ELF coredumps.
      For one thing, several fields are 64bit for native processes
      and 32bit for compat ones (pr_sigpend, etc.).  For another,
      the register dump is obviously different - the size and number
      of registers are not going to be the same for 32bit and 64bit
      variants of processor.
      
      Usually that's handled by having two structures - elf_prstatus
      for native layout and compat_elf_prstatus for 32bit one.
      32bit processes are handled by fs/compat_binfmt_elf.c, which
      defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
      Then it includes fs/binfmt_elf.c, which makes all references to
      struct elf_prstatus to be textually replaced with struct
      compat_elf_prstatus.  Ugly and somewhat brittle, but it works.
      
      However, amd64 is worse - there are _three_ possible layouts.
      One for native 64bit processes, another for i386 (32bit) processes
      and yet another for x32 (32bit address space with full 64bit
      registers).
      
      Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
      with usual compat_binfmt_elf.c trickery.  However, the layouts
      for i386 and x32 are not identical - they have the common beginning,
      but the register dump part (pr_reg) is bigger on x32.  Worse, pr_reg
      is not the last field - it's followed by int pr_fpvalid, so that
      field ends up at different offsets for i386 and x32 layouts.
      
      Fortunately, there's not much code that cares about any of that -
      it's all encapsulated in fill_thread_core_info().  Since x32
      variant is bigger, we define compat_elf_prstatus to match that
      layout.  That way i386 processes have enough space to fit
      their layout into.
      
      Moreover, since these layouts are identical prior to pr_reg,
      we don't need to distinguish x32 and i386 cases when we are
      setting the fields prior to pr_reg.
      
      Filling pr_reg itself is done by calling ->get() method of
      appropriate regset, and that method knows what layout (and size)
      to use.
      
      We do need to distinguish x32 and i386 cases only for two
      things: setting ->pr_fpvalid (offset differs for x32 and
      i386) and choosing the right size for our note.
      
      The way it's done is Not Nice, for the lack of more accurate
      printable description.  There are two macros (PRSTATUS_SIZE and
      SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
      and (S)->pr_fpvalid = 1.  On x86 asm/compat.h provides its own
      variants.
      
      Unfortunately, quite a few things go wrong there:
      	* PRSTATUS_SIZE doesn't use the normal test for process
      being an x32 one; it compares the size reported by regset with
      the size of pr_reg.
      	* it hardcodes the sizes of x32 and i386 variants (296 and 144
      resp.), so if some change in includes leads to asm/compat.h pulled
      in by fs/binfmt_elf.c we are in trouble - it will end up using
      the size of x32 variant for 64bit processes.
      	* it's in the wrong place; asm/compat.h couldn't define
      the structure for i386 layout, since it lacks quite a few types
      needed for it.  Hardcoded sizes are largely due to that.
      
      The proper fix would be to have an explicitly defined i386 variant
      of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
      TIF_X32 to choose the variant that should be used.  Unfortunately,
      that requires some manipulations of headers; we'll do that later
      in the series, but for now let's go with the minimal variant -
      rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
      have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
      and use the normal TIF_X32 check in that macro.  The size of i386 variant
      is kept hardcoded for now.  Similar story for SET_PR_FPVALID.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8a00dd00
  20. 11 12月, 2020 1 次提交
    • E
      coredump: Document coredump code exclusively used by cell spufs · c39ab6de
      Eric W. Biederman 提交于
      Oleg Nesterov recently asked[1] why is there an unshare_files in
      do_coredump.  After digging through all of the callers of lookup_fd it
      turns out that it is
      arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
      that needs the unshare_files in do_coredump.
      
      Looking at the history[2] this code was also the only piece of coredump code
      that required the unshare_files when the unshare_files was added.
      
      Looking at that code it turns out that cell is also the only
      architecture that implements elf_coredump_extra_notes_size and
      elf_coredump_extra_notes_write.
      
      I looked at the gdb repo[3] support for cell has been removed[4] in binutils
      2.34.  Geoff Levand reports he is still getting questions on how to
      run modern kernels on the PS3, from people using 3rd party firmware so
      this code is not dead.  According to Wikipedia the last PS3 shipped in
      Japan sometime in 2017.  So it will probably be a little while before
      everyone's hardware dies.
      
      Add some comments briefly documenting the coredump code that exists
      only to support cell spufs to make it easier to understand the
      coredump code.  Eventually the hardware will be dead, or their won't
      be userspace tools, or the coredump code will be refactored and it
      will be too difficult to update a dead architecture and these comments
      make it easy to tell where to pull to remove cell spufs support.
      
      [1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
      [2] 179e037f ("do_coredump(): make sure that descriptor table isn't shared")
      [3] git://sourceware.org/git/binutils-gdb.git
      [4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
      Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>
      c39ab6de
  21. 30 10月, 2020 1 次提交
  22. 26 10月, 2020 2 次提交
  23. 19 10月, 2020 1 次提交
    • J
      binfmt_elf: take the mmap lock around find_extend_vma() · b2767d97
      Jann Horn 提交于
      create_elf_tables() runs after setup_new_exec(), so other tasks can
      already access our new mm and do things like process_madvise() on it.  (At
      the time I'm writing this commit, process_madvise() is not in mainline
      yet, but has been in akpm's tree for some time.)
      
      While I believe that there are currently no APIs that would actually allow
      another process to mess up our VMA tree (process_madvise() is limited to
      MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
      under which no syscalls have been executed yet), this seems like an
      accident waiting to happen.
      
      Let's make sure that we always take the mmap lock around GUP paths as long
      as another process might be able to see the mm.
      
      (Yes, this diff looks suspicious because we drop the lock before doing
      anything with `vma`, but that's because we actually don't do anything with
      it apart from the NULL check.)
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichel Lespinasse <walken@google.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
      Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2767d97
  24. 17 10月, 2020 2 次提交
    • J
      binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot · a07279c9
      Jann Horn 提交于
      In both binfmt_elf and binfmt_elf_fdpic, use a new helper
      dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
      VMA, if we have one) while protected by the mmap_lock, and then use that
      snapshot instead of walking the VMA list without locking.
      
      An alternative approach would be to keep the mmap_lock held across the
      entire core dumping operation; however, keeping the mmap_lock locked while
      we may be blocked for an unbounded amount of time (e.g.  because we're
      dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
      blocks things like the ->release handler of userfaultfd, and we don't
      really want critical system daemons to grind to a halt just because
      someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
      something like that.
      
      Since both the normal ELF code and the FDPIC ELF code need this
      functionality (and if any other binfmt wants to add coredump support in
      the future, they'd probably need it, too), implement this with a common
      helper in fs/coredump.c.
      
      A downside of this approach is that we now need a bigger amount of kernel
      memory per userspace VMA in the normal ELF case, and that we need O(n)
      kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
      be terribly bad.
      
      There currently is a data race between stack expansion and anything that
      reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
      mitigate that for core dumping, take the mmap_lock in write mode when
      taking a snapshot of the VMA hierarchy.  (If we only took the mmap_lock in
      read mode, we could end up with a corrupted core dump if someone does
      get_user_pages_remote() concurrently.  Not really a major problem, but
      taking the mmap_lock either way works here, so we might as well avoid the
      issue.) (This doesn't do anything about the existing data races with stack
      expansion in other mm code.)
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a07279c9
    • J
      coredump: rework elf/elf_fdpic vma_dump_size() into common helper · 429a22e7
      Jann Horn 提交于
      At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
      different code to figure out which VMAs should be dumped, and if so,
      whether the dump should contain the entire VMA or just its first page.
      
      Eliminate duplicate code by reworking the binfmt_elf version into a
      generic core dumping helper in coredump.c.
      
      As part of that, change the heuristic for detecting executable/library
      header pages to check whether the inode is executable instead of looking
      at the file mode.
      
      This is less problematic in terms of locking because it lets us avoid
      get_user() under the mmap_sem.  (And arguably it looks nicer and makes
      more sense in generic code.)
      
      Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
      only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
      has been written to.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      429a22e7