提交 · e92edb85d87e9f8c8bead9cdb7cb2da4b51fd7f8 · openeuler / Kernel

25 11月, 2022 3 次提交

[elf] unify regset and non-regset cases · e92edb85

由 Al Viro 提交于 9月 05, 2022

The only real difference is in filling per-thread notes - getting
the values of registers.   And this is the only part that is worth
an ifdef - we don't need to duplicate the logics regarding gathering
threads, filling other notes, etc.

It would've been hard to do back when regset-based variant had been
introduced, mostly due to sharing bits and pieces of helpers with
aout coredumps.  As the result, too much had been duplicated and
the copies had drifted away since then.  Now it can be done cleanly...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e92edb85

[elf][non-regset] use elf_core_copy_task_regs() for dumper as well · e961d370

由 Al Viro 提交于 9月 04, 2022

elf_core_copy_regs() is equivalent to elf_core_copy_task_regs() of
current on all architectures.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e961d370

[elf][non-regset] uninline elf_core_copy_task_fpregs() (and lose pt_regs argument) · bdbadfcc

由 Al Viro 提交于 9月 03, 2022

Don't bother with pointless macros - we are not sharing it with aout coredumps
anymore.  Just convert the underlying functions to the same arguments (nobody
uses regs, actually) and call them elf_core_copy_task_fpregs().  And unexport
the entire bunch, while we are at it.

[added missing includes in arch/{csky,m68k,um}/kernel/process.c to avoid extra
warnings about the lack of externs getting added to huge piles for those
files.  Pointless, but...]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bdbadfcc

24 10月, 2022 3 次提交

[elf][regset] simplify thread list handling in fill_note_info() · 4b0e21d6

由 Al Viro 提交于 6月 08, 2020

fill_note_info() iterates through the list of threads collected in
mm->core_state->dumper, allocating a struct elf_thread_core_info
instance for each and linking those into a list.

We need the entry corresponding to current to be first in the
resulting list, so the logics for list insertion is
	if it's for current or list is empty
		insert in the head
	else
		insert after the first element

However, in mm->core_state->dumper the entry for current is guaranteed
to be the first one.  Which means that both parts of condition will
be true on the first iteration and neither will be true on all subsequent
ones.

Taking the first iteration out of the loop simplifies things nicely...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4b0e21d6

A
[elf][regset] clean fill_note_info() a bit · 922ef161
由 Al Viro 提交于 9月 05, 2022
```
*info is already initialized...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
922ef161

kill coredump_params->regs · 9a938eba

由 Al Viro 提交于 6月 08, 2020

it's always task_pt_regs(current)
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9a938eba

16 4月, 2022 2 次提交

revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" · aeb79237

由 Andrew Morton 提交于 4月 14, 2022

Despite Mike's attempted fix (925346c1), regressions reports
continue:

  https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
  https://bugzilla.kernel.org/show_bug.cgi?id=215720
  https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

So revert this patch.

Fixes: 9630f0d6 ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

aeb79237

revert "fs/binfmt_elf: fix PT_LOAD p_align values for loaders" · 354e923d

由 Andrew Morton 提交于 4月 14, 2022

Commit 925346c1 ("fs/binfmt_elf: fix PT_LOAD p_align values for
loaders") was an attempt to fix regressions due to 9630f0d6
("fs/binfmt_elf: use PT_LOAD p_align values for static PIE").

But regressionss continue to be reported:

  https://lore.kernel.org/lkml/cb5b81bd-9882-e5dc-cd22-54bdbaaefbbc@leemhuis.info/
  https://bugzilla.kernel.org/show_bug.cgi?id=215720
  https://lkml.kernel.org/r/b685f3d0-da34-531d-1aa9-479accd3e21b@leemhuis.info

This patch reverts the fix, so the original can also be reverted.

Fixes: 925346c1 ("fs/binfmt_elf: fix PT_LOAD p_align values for loaders")
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

354e923d

19 3月, 2022 1 次提交

binfmt_elf: Don't write past end of notes for regset gap · dd664099

由 Rick Edgecombe 提交于 3月 17, 2022

In fill_thread_core_info() the ptrace accessible registers are collected
to be written out as notes in a core file. The note array is allocated
from a size calculated by iterating the user regset view, and counting the
regsets that have a non-zero core_note_type. However, this only allows for
there to be non-zero core_note_type at the end of the regset view. If
there are any gaps in the middle, fill_thread_core_info() will overflow the
note allocation, as it iterates over the size of the view and the
allocation would be smaller than that.

There doesn't appear to be any arch that has gaps such that they exceed
the notes allocation, but the code is brittle and tries to support
something it doesn't. It could be fixed by increasing the allocation size,
but instead just have the note collecting code utilize the array better.
This way the allocation can stay smaller.

Even in the case of no arch's that have gaps in their regset views, this
introduces a change in the resulting indicies of t->notes. It does not
introduce any changes to the core file itself, because any blank notes are
skipped in write_note_info().

In case, the allocation logic between fill_note_info() and
fill_thread_core_info() ever diverges from the usage logic, warn and skip
writing any notes that would overflow the array.

This fix is derrived from an earlier one[0] by Yu-cheng Yu.

[0] https://lore.kernel.org/lkml/20180717162502.32274-1-yu-cheng.yu@intel.com/Co-developed-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: NYu-cheng Yu <yu-cheng.yu@intel.com>
Signed-off-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220317192013.13655-4-rick.p.edgecombe@intel.com

dd664099

09 3月, 2022 3 次提交

coredump: Use the vma snapshot in fill_files_note · 390031c9

由 Eric W. Biederman 提交于 3月 08, 2022

Matthew Wilcox reported that there is a missing mmap_lock in
file_files_note that could possibly lead to a user after free.

Solve this by using the existing vma snapshot for consistency
and to avoid the need to take the mmap_lock anywhere in the
coredump code except for dump_vma_snapshot.

Update the dump_vma_snapshot to capture vm_pgoff and vm_file
that are neeeded by fill_files_note.

Add free_vma_snapshot to free the captured values of vm_file.
Reported-by: NMatthew Wilcox <willy@infradead.org>
Link: https://lkml.kernel.org/r/20220131153740.2396974-1-willy@infradead.org
Cc: stable@vger.kernel.org
Fixes: a07279c9 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: 2aa362c4 ("coredump: extend core dump note section to contain file names of mapped files")
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

390031c9

coredump/elf: Pass coredump_params into fill_note_info · 9ec7d323

由 Eric W. Biederman 提交于 1月 31, 2022

Instead of individually passing cprm->siginfo and cprm->regs
into fill_note_info pass all of struct coredump_params.

This is preparation to allow fill_files_note to use the existing
vma snapshot.
Reviewed-by: NJann Horn <jannh@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

9ec7d323

coredump: Snapshot the vmas in do_coredump · 95c5436a

由 Eric W. Biederman 提交于 3月 08, 2022

Move the call of dump_vma_snapshot and kvfree(vma_meta) out of the
individual coredump routines into do_coredump itself.  This makes
the code less error prone and easier to maintain.

Make the vma snapshot available to the coredump routines
in struct coredump_params.  This makes it easier to
change and update what is captures in the vma snapshot
and will be needed for fixing fill_file_notes.
Reviewed-by: NJann Horn <jannh@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

95c5436a

04 3月, 2022 1 次提交

binfmt_elf: Introduce KUnit test · 9e1a3ce0

由 Kees Cook 提交于 2月 23, 2022

Adds simple KUnit test for some binfmt_elf internals: specifically a
regression test for the problem fixed by commit 8904d9cd90ee ("ELF:
fix overflow in total mapping size calculation").

$ ./tools/testing/kunit/kunit.py run --arch x86_64 \
    --kconfig_add CONFIG_IA32_EMULATION=y '*binfmt_elf'
...
[19:41:08] ================== binfmt_elf (1 subtest) ==================
[19:41:08] [PASSED] total_mapping_size_test
[19:41:08] =================== [PASSED] binfmt_elf ====================
[19:41:08] ============== compat_binfmt_elf (1 subtest) ===============
[19:41:08] [PASSED] total_mapping_size_test
[19:41:08] ================ [PASSED] compat_binfmt_elf ================
[19:41:08] ============================================================
[19:41:08] Testing complete. Passed: 2, Failed: 0, Crashed: 0, Skipped: 0, Errors: 0

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: David Gow <davidgow@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Magnus Groß" <magnus.gross@rwth-aachen.de>
Cc: kunit-dev@googlegroups.com
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
---
v1: https://lore.kernel.org/lkml/20220224054332.1852813-1-keescook@chromium.org
v2:
 - improve commit log
 - fix comment URL (Daniel)
 - drop redundant KUnit Kconfig help info (Daniel)
 - note in Kconfig help that COMPAT builds add a compat test (David)

9e1a3ce0

02 3月, 2022 5 次提交

fs/binfmt_elf: Refactor load_elf_binary function · 2b4bfbe0

由 Akira Kawata 提交于 1月 27, 2022

I delete load_addr because it is not used anymore. And I rename
load_addr_set to first_pt_load because it is used only to capture the
first iteration of the loop.
Signed-off-by: NAkira Kawata <akirakawata1@gmail.com>
Acked-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-3-akirakawata1@gmail.com

2b4bfbe0

fs/binfmt_elf: Fix AT_PHDR for unusual ELF files · 0da1d500

由 Akira Kawata 提交于 1月 27, 2022

BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=197921

As pointed out in the discussion of buglink, we cannot calculate AT_PHDR
as the sum of load_addr and exec->e_phoff.

: The AT_PHDR of ELF auxiliary vectors should point to the memory address
: of program header. But binfmt_elf.c calculates this address as follows:
:
: NEW_AUX_ENT(AT_PHDR, load_addr + exec->e_phoff);
:
: which is wrong since e_phoff is the file offset of program header and
: load_addr is the memory base address from PT_LOAD entry.
:
: The ld.so uses AT_PHDR as the memory address of program header. In normal
: case, since the e_phoff is usually 64 and in the first PT_LOAD region, it
: is the correct program header address.
:
: But if the address of program header isn't equal to the first PT_LOAD
: address + e_phoff (e.g.  Put the program header in other non-consecutive
: PT_LOAD region), ld.so will try to read program header from wrong address
: then crash or use incorrect program header.

This is because exec->e_phoff
is the offset of PHDRs in the file and the address of PHDRs in the
memory may differ from it. This patch fixes the bug by calculating the
address of program headers from PT_LOADs directly.
Signed-off-by: NAkira Kawata <akirakawata1@gmail.com>
Reported-by: Nkernel test robot <lkp@intel.com>
Acked-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220127124014.338760-2-akirakawata1@gmail.com

0da1d500

binfmt: move more stuff undef CONFIG_COREDUMP · d65bc29b

由 Alexey Dobriyan 提交于 2月 13, 2022

struct linux_binfmt::core_dump and struct min_coredump::min_coredump
are used under CONFIG_COREDUMP only. Shrink those embedded configs
a bit.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/YglbIFyN+OtwVyjW@localhost.localdomain

d65bc29b

ELF: fix overflow in total mapping size calculation · 10b19249

由 Alexey Dobriyan 提交于 10月 03, 2021

Kernel assumes that ELF program headers are ordered by mapping address,
but doesn't enforce it. It is possible to make mapping size extremely huge
by simply shuffling first and last PT_LOAD segments.

As long as PT_LOAD segments do not overlap, it is silly to require
sorting by v_addr anyway because mmap() doesn't care.

Don't assume PT_LOAD segments are sorted and calculate min and max
addresses correctly.
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Tested-by: N"Magnus Groß" <magnus.gross@rwth-aachen.de>
Link: https://lore.kernel.org/all/Yfqm7HbucDjPbES+@fractal.localdomain/Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/lkml/YVmd7D0M6G%2FDcP4O@localhost.localdomain

10b19249

binfmt_elf: Avoid total_mapping_size for ET_EXEC · 439a8468

由 Kees Cook 提交于 2月 28, 2022

Partially revert commit 5f501d55 ("binfmt_elf: reintroduce using
MAP_FIXED_NOREPLACE"), which applied the ET_DYN "total_mapping_size"
logic also to ET_EXEC.

At least ia64 has ET_EXEC PT_LOAD segments that are not virtual-address
contiguous (but _are_ file-offset contiguous). This would result in a
giant mapping attempting to cover the entire span, including the virtual
address range hole, and well beyond the size of the ELF file itself,
causing the kernel to refuse to load it. For example:

$ readelf -lW /usr/bin/gcc
...
Program Headers:
  Type Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   ...
...
  LOAD 0x000000 0x4000000000000000 0x4000000000000000 0x00b5a0 0x00b5a0 ...
  LOAD 0x00b5a0 0x600000000000b5a0 0x600000000000b5a0 0x0005ac 0x000710 ...
...
       ^^^^^^^^ ^^^^^^^^^^^^^^^^^^                    ^^^^^^^^ ^^^^^^^^

File offset range     : 0x000000-0x00bb4c
			0x00bb4c bytes

Virtual address range : 0x4000000000000000-0x600000000000bcb0
			0x200000000000bcb0 bytes

Remove the total_mapping_size logic for ET_EXEC, which reduces the
ET_EXEC MAP_FIXED_NOREPLACE coverage to only the first PT_LOAD (better
than nothing), and retains it for ET_DYN.

Ironically, this is the reverse of the problem that originally caused
problems with MAP_FIXED_NOREPLACE: overlapping PT_LOAD segments. Future
work could restore full coverage if load_elf_binary() were to perform
mappings in a separate phase from the loading (where it could resolve
both overlaps and holes).

Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Reported-by: Nmatoro <matoro_bugzilla_kernel@matoro.tk>
Fixes: 5f501d55 ("binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE")
Link: https://lore.kernel.org/r/a3edd529-c42d-3b09-135c-7e98a15b150f@leemhuis.infoTested-by: Nmatoro <matoro_mailinglist_kernel@matoro.tk>
Link: https://lore.kernel.org/lkml/ce8af9c13bcea9230c7689f3c1e0e2cd@matoro.tkTested-By: NJohn Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Link: https://lore.kernel.org/lkml/49182d0d-708b-4029-da5f-bc18603440a6@physik.fu-berlin.de
Cc: stable@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>

439a8468

12 2月, 2022 1 次提交

fs/binfmt_elf: fix PT_LOAD p_align values for loaders · 925346c1

由 Mike Rapoport 提交于 2月 11, 2022

Rui Salvaterra reported that Aisleroit solitaire crashes with "Wrong
__data_start/_end pair" assertion from libgc after update to v5.17-rc1.

Bisection pointed to commit 9630f0d6 ("fs/binfmt_elf: use PT_LOAD
p_align values for static PIE") that fixed handling of static PIEs, but
made the condition that guards load_bias calculation to exclude loader
binaries.

Restoring the check for presence of interpreter fixes the problem.

Link: https://lkml.kernel.org/r/20220202121433.3697146-1-rppt@kernel.org
Fixes: 9630f0d6 ("fs/binfmt_elf: use PT_LOAD p_align values for static PIE")
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Reported-by: NRui Salvaterra <rsalvaterra@gmail.com>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: "H.J. Lu" <hjl.tools@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

925346c1

20 1月, 2022 2 次提交

fs/binfmt_elf: use PT_LOAD p_align values for static PIE · 9630f0d6

由 H.J. Lu 提交于 1月 19, 2022

Extend commit ce81bb25 ("fs/binfmt_elf: use PT_LOAD p_align values
for suitable start address") which fixed PIE binaries built with
-Wl,-z,max-page-size=0x200000, to cover static PIE binaries.  This
fixes:

    https://bugzilla.kernel.org/show_bug.cgi?id=215275

Tested by verifying static PIE binaries with -Wl,-z,max-page-size=0x200000 loading.

Link: https://lkml.kernel.org/r/20211209174052.370537-1-hjl.tools@gmail.comSigned-off-by: NH.J. Lu <hjl.tools@gmail.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Sandeep Patil <sspatil@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9630f0d6

fs/binfmt_elf: replace open-coded string copy with get_task_comm · 95af469c

由 Yafang Shao 提交于 1月 19, 2022

It is better to use get_task_comm() instead of the open coded string
copy as we do in other places.

struct elf_prpsinfo is used to dump the task information in userspace
coredump or kernel vmcore.  Below is the verification of vmcore,

  crash> ps
     PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
        0      0   0  ffffffff9d21a940  RU   0.0       0      0  [swapper/0]
  >     0      0   1  ffffa09e40f85e80  RU   0.0       0      0  [swapper/1]
  >     0      0   2  ffffa09e40f81f80  RU   0.0       0      0  [swapper/2]
  >     0      0   3  ffffa09e40f83f00  RU   0.0       0      0  [swapper/3]
  >     0      0   4  ffffa09e40f80000  RU   0.0       0      0  [swapper/4]
  >     0      0   5  ffffa09e40f89f80  RU   0.0       0      0  [swapper/5]
        0      0   6  ffffa09e40f8bf00  RU   0.0       0      0  [swapper/6]
  >     0      0   7  ffffa09e40f88000  RU   0.0       0      0  [swapper/7]
  >     0      0   8  ffffa09e40f8de80  RU   0.0       0      0  [swapper/8]
  >     0      0   9  ffffa09e40f95e80  RU   0.0       0      0  [swapper/9]
  >     0      0  10  ffffa09e40f91f80  RU   0.0       0      0  [swapper/10]
  >     0      0  11  ffffa09e40f93f00  RU   0.0       0      0  [swapper/11]
  >     0      0  12  ffffa09e40f90000  RU   0.0       0      0  [swapper/12]
  >     0      0  13  ffffa09e40f9bf00  RU   0.0       0      0  [swapper/13]
  >     0      0  14  ffffa09e40f98000  RU   0.0       0      0  [swapper/14]
  >     0      0  15  ffffa09e40f9de80  RU   0.0       0      0  [swapper/15]

It works well as expected.

Some comments are added to explain why we use the hard-coded 16.

Link: https://lkml.kernel.org/r/20211120112738.45980-5-laoar.shao@gmail.comSuggested-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Arnaldo Carvalho de Melo <arnaldo.melo@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Michal Miroslaw <mirq-linux@rere.qmqm.pl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Kees Cook <keescook@chromium.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Dennis Dalessandro <dennis.dalessandro@cornelisnetworks.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

95af469c

10 11月, 2021 2 次提交

ELF: simplify STACK_ALLOC macro · a43e5e3a

由 Alexey Dobriyan 提交于 11月 08, 2021

"A -= B; A" is equivalent to "A -= B".

Link: https://lkml.kernel.org/r/YVmcP256fRMqCwgK@localhost.localdomainSigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a43e5e3a

binfmt_elf: reintroduce using MAP_FIXED_NOREPLACE · 5f501d55

由 Kees Cook 提交于 11月 08, 2021

Commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") reverted back to using MAP_FIXED to map ELF LOAD
segments because it was found that the segments in some binaries overlap
and can cause MAP_FIXED_NOREPLACE to fail.

The original intent of MAP_FIXED_NOREPLACE in the ELF loader was to
prevent the silent clobbering of an existing mapping (e.g.  stack) by
the ELF image, which could lead to exploitable conditions.  Quoting
commit 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map"),
which originally introduced the use of MAP_FIXED_NOREPLACE in the
loader:

    Both load_elf_interp and load_elf_binary rely on elf_map to map
    segments [to a specific] address and they use MAP_FIXED to enforce
    that. This is however [a] dangerous thing prone to silent data
    corruption which can be even exploitable.
    ...
    Let's take CVE-2017-1000253 as an example ... we could end up mapping
    [the executable] over the existing stack ... The [stack layout] issue
    has been fixed since then ... So we should be safe and any [similar]
    attack should be impractical. On the other hand this is just too
    subtle [an] assumption ... it can break quite easily and [be] hard to
    spot.
    ...
    Address this [weakness] by changing MAP_FIXED to the newly added
    MAP_FIXED_NOREPLACE. This will mean that mmap will fail if there is
    an existing mapping clashing with the requested one [instead of
    silently] clobbering it.

Then processing ET_DYN binaries the loader already calculates a total
size for the image when the first segment is mapped, maps the entire
image, and then unmaps the remainder before the remaining segments are
then individually mapped.

To avoid the earlier problems (legitimate overlapping LOAD segments
specified in the ELF), apply the same logic to ET_EXEC binaries as well.

For both ET_EXEC and ET_DYN+INTERP use MAP_FIXED_NOREPLACE for the
initial total size mapping and then use MAP_FIXED to build the final
(possibly legitimately overlapping) mappings.  For ET_DYN w/out INTERP,
continue to map at a system-selected address in the mmap region.

Link: https://lkml.kernel.org/r/20210916215947.3993776-1-keescook@chromium.org
Link: https://lore.kernel.org/lkml/1595869887-23307-2-git-send-email-anthony.yznaga@oracle.comCo-developed-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: NAnthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Chen Jingwen <chenjingwen6@huawei.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrei Vagin <avagin@openvz.org>
Cc: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f501d55

09 10月, 2021 1 次提交

coredump: Limit coredumps to a single thread group · 0258b5fd

由 Eric W. Biederman 提交于 9月 22, 2021

Today when a signal is delivered with a handler of SIG_DFL whose
default behavior is to generate a core dump not only that process but
every process that shares the mm is killed.

In the case of vfork this looks like a real world problem.  Consider
the following well defined sequence.

	if (vfork() == 0) {
		execve(...);
		_exit(EXIT_FAILURE);
	}

If a signal that generates a core dump is received after vfork but
before the execve changes the mm the process that called vfork will
also be killed (as the mm is shared).

Similarly if the execve fails after the point of no return the kernel
delivers SIGSEGV which will kill both the exec'ing process and because
the mm is shared the process that called vfork as well.

As far as I can tell this behavior is a violation of people's
reasonable expectations, POSIX, and is unnecessarily fragile when the
system is low on memory.

Solve this by making a userspace visible change to only kill a single
process/thread group.  This is possible because Jann Horn recently
modified[1] the coredump code so that the mm can safely be modified
while the coredump is happening.  With LinuxThreads long gone I don't
expect anyone to have a notice this behavior change in practice.

To accomplish this move the core_state pointer from mm_struct to
signal_struct, which allows different thread groups to coredump
simultatenously.

In zap_threads remove the work to kill anything except for the current
thread group.

v2: Remove core_state from the VM_BUG_ON_MM print to fix
    compile failure when CONFIG_DEBUG_VM is enabled.
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>

[1] a07279c9 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Link: https://lkml.kernel.org/r/87y27mvnke.fsf@disp2133
Link: https://lkml.kernel.org/r/20211007144701.67592574@canb.auug.org.auReviewed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

0258b5fd

04 10月, 2021 1 次提交

elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings · 9b2f72cc

由 Chen Jingwen 提交于 9月 28, 2021

In commit b212921b ("elf: don't use MAP_FIXED_NOREPLACE for elf
executable mappings") we still leave MAP_FIXED_NOREPLACE in place for
load_elf_interp.

Unfortunately, this will cause kernel to fail to start with:

    1 (init): Uhuuh, elf segment at 00003ffff7ffd000 requested but the memory is mapped already
    Failed to execute /init (error -17)

The reason is that the elf interpreter (ld.so) has overlapping segments.

  readelf -l ld-2.31.so
  Program Headers:
    Type           Offset             VirtAddr           PhysAddr
                   FileSiz            MemSiz              Flags  Align
    LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                   0x000000000002c94c 0x000000000002c94c  R E    0x10000
    LOAD           0x000000000002dae0 0x000000000003dae0 0x000000000003dae0
                   0x00000000000021e8 0x0000000000002320  RW     0x10000
    LOAD           0x000000000002fe00 0x000000000003fe00 0x000000000003fe00
                   0x00000000000011ac 0x0000000000001328  RW     0x10000

The reason for this problem is the same as described in commit
ad55eac7 ("elf: enforce MAP_FIXED on overlaying elf segments").

Not only executable binaries, elf interpreters (e.g. ld.so) can have
overlapping elf segments, so we better drop MAP_FIXED_NOREPLACE and go
back to MAP_FIXED in load_elf_interp.

Fixes: 4ed28639 ("fs, elf: drop MAP_FIXED usage from elf_map")
Cc: <stable@vger.kernel.org> # v4.19
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NChen Jingwen <chenjingwen6@huawei.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9b2f72cc

04 9月, 2021 2 次提交

binfmt: remove in-tree usage of MAP_DENYWRITE · 4589ff7c

由 David Hildenbrand 提交于 4月 23, 2021

At exec time when we mmap the new executable via MAP_DENYWRITE we have it
opened via do_open_execat() and already deny_write_access()'ed the file
successfully. Once exec completes, we allow_write_acces(); however,
we set mm->exe_file in begin_new_exec() via set_mm_exe_file() and
also deny_write_access() as long as mm->exe_file remains set. We'll
effectively deny write access to our executable via mm->exe_file
until mm->exe_file is changed -- when the process is removed, on new
exec, or via sys_prctl(PR_SET_MM_MAP/EXE_FILE).

Let's remove all usage of MAP_DENYWRITE, it's no longer necessary for
mm->exe_file.

In case of an elf interpreter, we'll now only deny write access to the file
during exec. This is somewhat okay, because the interpreter behaves
(and sometime is) a shared library; all shared libraries, especially the
ones loaded directly in user space like via dlopen() won't ever be mapped
via MAP_DENYWRITE, because we ignore that from user space completely;
these shared libraries can always be modified while mapped and executed.
Let's only special-case the main executable, denying write access while
being executed by a process. This can be considered a minor user space
visible change.

While this is a cleanup, it also fixes part of a problem reported with
VM_DENYWRITE on overlayfs, as VM_DENYWRITE is effectively unused with
this patch and will be removed next:
  "Overlayfs did not honor positive i_writecount on realfile for
   VM_DENYWRITE mappings." [1]

[1] https://lore.kernel.org/r/YNHXzBgzRrZu1MrD@miu.piliscsaba.redhat.com/Reported-by: NChengguang Xu <cgxu519@mykernel.net>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>

4589ff7c

binfmt: don't use MAP_DENYWRITE when loading shared libraries via uselib() · 42be8b42

由 David Hildenbrand 提交于 4月 22, 2021

uselib() is the legacy systemcall for loading shared libraries.
Nowadays, applications use dlopen() to load shared libraries, completely
implemented in user space via mmap().

For example, glibc uses MAP_COPY to mmap shared libraries. While this
maps to MAP_PRIVATE | MAP_DENYWRITE on Linux, Linux ignores any
MAP_DENYWRITE specification from user space in mmap.

With this change, all remaining in-tree users of MAP_DENYWRITE use it
to map an executable. We will be able to open shared libraries loaded
via uselib() writable, just as we already can via dlopen() from user
space.

This is one step into the direction of removing MAP_DENYWRITE from the
kernel. This can be considered a minor user space visible change.
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>

42be8b42

30 6月, 2021 1 次提交

binfmt: remove in-tree usage of MAP_EXECUTABLE · a4eec6a3

由 David Hildenbrand 提交于 6月 28, 2021

Ever since commit e9714acf ("mm: kill vma flag VM_EXECUTABLE and
mm->num_exe_file_vmas"), VM_EXECUTABLE is gone and MAP_EXECUTABLE is
essentially completely ignored.  Let's remove all usage of MAP_EXECUTABLE.

[akpm@linux-foundation.org: fix blooper in fs/binfmt_aout.c. per David]

Link: https://lkml.kernel.org/r/20210421093453.6904-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Greg Ungerer <gerg@linux-m68k.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kevin Brodsky <Kevin.Brodsky@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a4eec6a3

18 6月, 2021 1 次提交

sched: Change task_struct::state · 2f064a59

由 Peter Zijlstra 提交于 6月 11, 2021

Change the type and name of task_struct::state. Drop the volatile and
shrink it to an 'unsigned int'. Rename it in order to find all uses
such that we can use READ_ONCE/WRITE_ONCE as appropriate.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: NWill Deacon <will@kernel.org>
Acked-by: NDaniel Thompson <daniel.thompson@linaro.org>
Link: https://lore.kernel.org/r/20210611082838.550736351@infradead.org

2f064a59

08 3月, 2021 1 次提交

coredump: don't bother with do_truncate() · d0f1088b

由 Al Viro 提交于 3月 08, 2020

have dump_skip() just remember how much needs to be skipped,
leave actual seeks/writing zeroes to the next dump_emit()
or the end of coredump output, whichever comes first.
And instead of playing with do_truncate() in the end, just
write one NUL at the end of the last gap (if any).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d0f1088b

16 2月, 2021 1 次提交

binfmt_misc: pass binfmt_misc flags to the interpreter · 2347961b

由 Laurent Vivier 提交于 1月 28, 2020

It can be useful to the interpreter to know which flags are in use.

For instance, knowing if the preserve-argv[0] is in use would
allow to skip the pathname argument.

This patch uses an unused auxiliary vector, AT_FLAGS, to add a
flag to inform interpreter if the preserve-argv[0] is enabled.

Note by Helge Deller:
The real-world user of this patch is qemu-user, which needs to know
if it has to preserve the argv[0]. See Debian bug #970460.
Signed-off-by: NLaurent Vivier <laurent@vivier.eu>
Reviewed-by: NYunQiang Su <ysu@wavecomp.com>
URL: http://bugs.debian.org/970460Signed-off-by: NHelge Deller <deller@gmx.de>

2347961b

06 1月, 2021 1 次提交

elf_prstatus: collect the common part (everything before pr_reg) into a struct · f2485a2d

由 Al Viro 提交于 6月 13, 2020

Preparations to doing i386 compat elf_prstatus sanely - rather than duplicating
the beginning of compat_elf_prstatus, take these fields into a separate
structure (compat_elf_prstatus_common), so that it could be reused.  Due to
the incestous relationship between binfmt_elf.c and compat_binfmt_elf.c we
need the same shape change done to native struct elf_prstatus, gathering the
fields prior to pr_reg into a new structure (struct elf_prstatus_common).

Fortunately, offset of pr_reg is always a multiple of 16 with no padding
right before it, so it's possible to turn all the stuff prior to it into
a single member without disturbing the layout.

[build fix from Geert Uytterhoeven folded in]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f2485a2d

05 1月, 2021 1 次提交

binfmt_elf: partially sanitize PRSTATUS_SIZE and SET_PR_FPVALID · 8a00dd00

由 Al Viro 提交于 6月 22, 2020

On 64bit architectures that support 32bit processes there are
two possible layouts for NT_PRSTATUS note in ELF coredumps.
For one thing, several fields are 64bit for native processes
and 32bit for compat ones (pr_sigpend, etc.).  For another,
the register dump is obviously different - the size and number
of registers are not going to be the same for 32bit and 64bit
variants of processor.

Usually that's handled by having two structures - elf_prstatus
for native layout and compat_elf_prstatus for 32bit one.
32bit processes are handled by fs/compat_binfmt_elf.c, which
defines a macro called 'elf_prstatus' that expands to compat_elf_prstatus.
Then it includes fs/binfmt_elf.c, which makes all references to
struct elf_prstatus to be textually replaced with struct
compat_elf_prstatus.  Ugly and somewhat brittle, but it works.

However, amd64 is worse - there are _three_ possible layouts.
One for native 64bit processes, another for i386 (32bit) processes
and yet another for x32 (32bit address space with full 64bit
registers).

Both i386 and x32 processes are handled by fs/compat_binfmt_elf.c,
with usual compat_binfmt_elf.c trickery.  However, the layouts
for i386 and x32 are not identical - they have the common beginning,
but the register dump part (pr_reg) is bigger on x32.  Worse, pr_reg
is not the last field - it's followed by int pr_fpvalid, so that
field ends up at different offsets for i386 and x32 layouts.

Fortunately, there's not much code that cares about any of that -
it's all encapsulated in fill_thread_core_info().  Since x32
variant is bigger, we define compat_elf_prstatus to match that
layout.  That way i386 processes have enough space to fit
their layout into.

Moreover, since these layouts are identical prior to pr_reg,
we don't need to distinguish x32 and i386 cases when we are
setting the fields prior to pr_reg.

Filling pr_reg itself is done by calling ->get() method of
appropriate regset, and that method knows what layout (and size)
to use.

We do need to distinguish x32 and i386 cases only for two
things: setting ->pr_fpvalid (offset differs for x32 and
i386) and choosing the right size for our note.

The way it's done is Not Nice, for the lack of more accurate
printable description.  There are two macros (PRSTATUS_SIZE and
SET_PR_FPVALID), that default essentially to sizeof(struct elf_prstatus)
and (S)->pr_fpvalid = 1.  On x86 asm/compat.h provides its own
variants.

Unfortunately, quite a few things go wrong there:
	* PRSTATUS_SIZE doesn't use the normal test for process
being an x32 one; it compares the size reported by regset with
the size of pr_reg.
	* it hardcodes the sizes of x32 and i386 variants (296 and 144
resp.), so if some change in includes leads to asm/compat.h pulled
in by fs/binfmt_elf.c we are in trouble - it will end up using
the size of x32 variant for 64bit processes.
	* it's in the wrong place; asm/compat.h couldn't define
the structure for i386 layout, since it lacks quite a few types
needed for it.  Hardcoded sizes are largely due to that.

The proper fix would be to have an explicitly defined i386 variant
of structure and have PRSTATUS_SIZE/SET_PR_FPVALID check for
TIF_X32 to choose the variant that should be used.  Unfortunately,
that requires some manipulations of headers; we'll do that later
in the series, but for now let's go with the minimal variant -
rename PRSTATUS_SIZE in asm/compat.h to COMPAT_PRSTATUS_SIZE,
have fs/compat_binfmt_elf.c define PRSTATUS_SIZE to COMPAT_PRSTATUS_SIZE
and use the normal TIF_X32 check in that macro.  The size of i386 variant
is kept hardcoded for now.  Similar story for SET_PR_FPVALID.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8a00dd00

11 12月, 2020 1 次提交

coredump: Document coredump code exclusively used by cell spufs · c39ab6de

由 Eric W. Biederman 提交于 11月 25, 2020

Oleg Nesterov recently asked[1] why is there an unshare_files in
do_coredump. After digging through all of the callers of lookup_fd it
turns out that it is
arch/powerpc/platforms/cell/spufs/coredump.c:coredump_next_context
that needs the unshare_files in do_coredump.

Looking at the history[2] this code was also the only piece of coredump code
that required the unshare_files when the unshare_files was added.

Looking at that code it turns out that cell is also the only
architecture that implements elf_coredump_extra_notes_size and
elf_coredump_extra_notes_write.

I looked at the gdb repo[3] support for cell has been removed[4] in binutils
2.34. Geoff Levand reports he is still getting questions on how to
run modern kernels on the PS3, from people using 3rd party firmware so
this code is not dead. According to Wikipedia the last PS3 shipped in
Japan sometime in 2017. So it will probably be a little while before
everyone's hardware dies.

Add some comments briefly documenting the coredump code that exists
only to support cell spufs to make it easier to understand the
coredump code. Eventually the hardware will be dead, or their won't
be userspace tools, or the coredump code will be refactored and it
will be too difficult to update a dead architecture and these comments
make it easy to tell where to pull to remove cell spufs support.

[1] https://lkml.kernel.org/r/20201123175052.GA20279@redhat.com
[2] 179e037f ("do_coredump(): make sure that descriptor table isn't shared")
[3] git://sourceware.org/git/binutils-gdb.git
[4] abf516c6931a ("Remove Cell Broadband Engine debugging support").
Link: https://lkml.kernel.org/r/87h7pdnlzv.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NEric W. Biederman <ebiederm@xmission.com>

c39ab6de

30 10月, 2020 1 次提交

fs: Replace zero-length array with flexible-array member · 5e01fdff

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

5e01fdff

26 10月, 2020 2 次提交

elf: Expose ELF header on arch_setup_additional_pages() · 9a29a671

由 Gabriel Krisman Bertazi 提交于 10月 03, 2020

Like it is done for SET_PERSONALITY with ARM, which requires the ELF
header to select correct personality parameters, x86 requires the
headers when selecting which VDSO to load, instead of relying on the
going-away TIF_IA32/X32 flags.

Add an indirection macro to arch_setup_additional_pages(), that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures, who can continue to use the
original arch_setup_additional_pages for ELF and non-ELF binaries.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-8-krisman@collabora.com

9a29a671

elf: Expose ELF header in compat_start_thread() · bc3d7bf6

由 Gabriel Krisman Bertazi 提交于 10月 03, 2020

Like it is done for SET_PERSONALITY with x86, which requires the ELF header
to select correct personality parameters, x86 requires the headers on
compat_start_thread() to choose starting CS for ELF32 binaries, instead of
relying on the going-away TIF_IA32/X32 flags.

Add an indirection macro to ELF invocations of START_THREAD, that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures who don't need the header
information, they can continue to use the original start_thread for ELF and
non-ELF binaries, and it prevents affecting non-ELF code paths for x86.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-6-krisman@collabora.com

bc3d7bf6

19 10月, 2020 1 次提交

binfmt_elf: take the mmap lock around find_extend_vma() · b2767d97

由 Jann Horn 提交于 10月 17, 2020

create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it.  (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)

While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.

Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.

(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichel Lespinasse <walken@google.com>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b2767d97

17 10月, 2020 2 次提交

binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot · a07279c9

由 Jann Horn 提交于 10月 15, 2020

In both binfmt_elf and binfmt_elf_fdpic, use a new helper
dump_vma_snapshot() to take a snapshot of the VMA list (including the gate
VMA, if we have one) while protected by the mmap_lock, and then use that
snapshot instead of walking the VMA list without locking.

An alternative approach would be to keep the mmap_lock held across the
entire core dumping operation; however, keeping the mmap_lock locked while
we may be blocked for an unbounded amount of time (e.g. because we're
dumping to a FUSE filesystem or so) isn't really optimal; the mmap_lock
blocks things like the ->release handler of userfaultfd, and we don't
really want critical system daemons to grind to a halt just because
someone "gifted" them SCM_RIGHTS to an eternally-locked userfaultfd, or
something like that.

Since both the normal ELF code and the FDPIC ELF code need this
functionality (and if any other binfmt wants to add coredump support in
the future, they'd probably need it, too), implement this with a common
helper in fs/coredump.c.

A downside of this approach is that we now need a bigger amount of kernel
memory per userspace VMA in the normal ELF case, and that we need O(n)
kernel memory in the FDPIC ELF case at all; but 40 bytes per VMA shouldn't
be terribly bad.

There currently is a data race between stack expansion and anything that
reads ->vm_start or ->vm_end under the mmap_lock held in read mode; to
mitigate that for core dumping, take the mmap_lock in write mode when
taking a snapshot of the VMA hierarchy. (If we only took the mmap_lock in
read mode, we could end up with a corrupted core dump if someone does
get_user_pages_remote() concurrently. Not really a major problem, but
taking the mmap_lock either way works here, so we might as well avoid the
issue.) (This doesn't do anything about the existing data races with stack
expansion in other mm code.)
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-6-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a07279c9

coredump: rework elf/elf_fdpic vma_dump_size() into common helper · 429a22e7

由 Jann Horn 提交于 10月 15, 2020

At the moment, the binfmt_elf and binfmt_elf_fdpic code have slightly
different code to figure out which VMAs should be dumped, and if so,
whether the dump should contain the entire VMA or just its first page.

Eliminate duplicate code by reworking the binfmt_elf version into a
generic core dumping helper in coredump.c.

As part of that, change the heuristic for detecting executable/library
header pages to check whether the inode is executable instead of looking
at the file mode.

This is less problematic in terms of locking because it lets us avoid
get_user() under the mmap_sem.  (And arguably it looks nicer and makes
more sense in generic code.)

Adjust a little bit based on the binfmt_elf_fdpic version: ->anon_vma is
only meaningful under CONFIG_MMU, otherwise we have to assume that the VMA
has been written to.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W . Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Link: http://lkml.kernel.org/r/20200827114932.3572699-5-jannh@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

429a22e7

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功