提交 · 34fdce6981b96920ced4e0ee56e9db3fb03a33f0 · openeuler / Kernel

You need to sign in or sign up before continuing.

01 5月, 2020 1 次提交

x86: Change {JMP,CALL}_NOSPEC argument · 34fdce69

由 Peter Zijlstra 提交于 4月 22, 2020

In order to change the {JMP,CALL}_NOSPEC macros to call out-of-line
versions of the retpoline magic, we need to remove the '%' from the
argument, such that we can paste it onto symbol names.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20200428191700.151623523@infradead.org

34fdce69

08 4月, 2020 1 次提交

sparc,x86: vdso: remove meaningless undefining CONFIG_OPTIMIZE_INLINING · 12a5b00a

由 Masahiro Yamada 提交于 4月 06, 2020

The code, #undef CONFIG_OPTIMIZE_INLINING, is not working as expected
because <linux/compiler_types.h> is parsed before vclock_gettime.c since
28128c61 ("kconfig.h: Include compiler types to avoid missed struct
attributes").

Since then, <linux/compiler_types.h> is included really early by using the
'-include' option.  So, you cannot negate the decision of
<linux/compiler_types.h> in this way.

You can confirm it by checking the pre-processed code, like this:

  $ make arch/x86/entry/vdso/vdso32/vclock_gettime.i

There is no difference with/without CONFIG_CC_OPTIMIZE_FOR_SIZE.

It is about two years since 28128c61.  Nobody has reported a problem
(or, nobody has even noticed the fact that this code is not working).

It is ugly and unreliable to attempt to undefine a CONFIG option from C
files, and anyway the inlining heuristic is up to the compiler.

Just remove the broken code.
Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NNathan Chancellor <natechancellor@gmail.com>
Acked-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Miller <davem@davemloft.net>
Link: http://lkml.kernel.org/r/20200220110807.32534-1-masahiroy@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

12a5b00a

27 3月, 2020 1 次提交

x86/vdso: Discard .note.gnu.property sections in vDSO · 4caffe6a

由 H.J. Lu 提交于 3月 26, 2020

With the command-line option -mx86-used-note=yes which can also be
enabled at binutils build time with:

  --enable-x86-used-note  generate GNU x86 used ISA and feature properties

the x86 assembler in binutils 2.32 and above generates a program property
note in a note section, .note.gnu.property, to encode used x86 ISAs and
features.  But kernel linker script only contains a single NOTE segment:

  PHDRS
  {
   text PT_LOAD FLAGS(5) FILEHDR PHDRS; /* PF_R|PF_X */
   dynamic PT_DYNAMIC FLAGS(4); /* PF_R */
   note PT_NOTE FLAGS(4); /* PF_R */
   eh_frame_hdr 0x6474e550;
  }

The NOTE segment generated by the vDSO linker script is aligned to 4 bytes.
But the .note.gnu.property section must be aligned to 8 bytes on x86-64:

  [hjl@gnu-skx-1 vdso]$ readelf -n vdso64.so

  Displaying notes found in: .note
    Owner                Data size 	Description
    Linux                0x00000004	Unknown note type: (0x00000000)
     description data: 06 00 00 00
  readelf: Warning: note with invalid namesz and/or descsz found at offset 0x20
  readelf: Warning:  type: 0x78, namesize: 0x00000100, descsize: 0x756e694c, alignment: 8

Since the note.gnu.property section in the vDSO is not checked by the
dynamic linker, discard the .note.gnu.property sections in the vDSO.

 [ bp: Massage. ]
Signed-off-by: NH.J. Lu <hjl.tools@gmail.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20200326174314.254662-1-hjl.tools@gmail.com

4caffe6a

25 3月, 2020 1 次提交

.gitignore: add SPDX License Identifier · d198b34f

由 Masahiro Yamada 提交于 3月 03, 2020

Add SPDX License Identifier to all .gitignore files.
Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d198b34f

21 3月, 2020 14 次提交

x86/entry: Rename ___preempt_schedule · 46db36ab

由 Peter Zijlstra 提交于 3月 20, 2020

Because moar '_' isn't always moar readable.

git grep -l "___preempt_schedule\(_notrace\)*" | while read file;
do
	sed -ie 's/___preempt_schedule\(_notrace\)*/preempt_schedule\1_thunk/g' $file;
done
Reported-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20200320115858.995685950@infradead.org

46db36ab

x86/entry: Drop asmlinkage from syscalls · 0f78ff17

由 Brian Gerst 提交于 3月 13, 2020

asmlinkage is no longer required since the syscall ABI is now fully under
x86 architecture control. This makes the 32-bit native syscalls a bit more
effecient by passing in regs via EAX instead of on the stack.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200313195144.164260-18-brgerst@gmail.com

0f78ff17

x86/entry/32: Enable pt_regs based syscalls · 25c619e5

由 Brian Gerst 提交于 3月 13, 2020

Enable pt_regs based syscalls for 32-bit. This makes the 32-bit native
kernel consistent with the 64-bit kernel, and improves the syscall
interface by not needing to push all 6 potential arguments onto the stack.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-17-brgerst@gmail.com

25c619e5

x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments · 121b32a5

由 Brian Gerst 提交于 3月 13, 2020

For the 32-bit syscall interface, 64-bit arguments (loff_t) are passed via
a pair of 32-bit registers. These register pairs end up in consecutive stack
slots, which matches the C ABI for 64-bit arguments. But when accessing the
registers directly from pt_regs, the wrapper needs to manually reassemble the
64-bit value. These wrappers already exist for 32-bit compat, so make them
available to 32-bit native in preparation for enabling pt_regs-based syscalls.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-16-brgerst@gmail.com

121b32a5

x86/entry/32: Rename 32-bit specific syscalls · 866128a9

由 Brian Gerst 提交于 3月 13, 2020

Rename the syscalls that only exist for 32-bit from x86_* to ia32_* to make it
clear they are for 32-bit only.  Also rename the functions to match the syscall
name.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-15-brgerst@gmail.com

866128a9

x86/entry/32: Clean up syscall_32.tbl · a845a6cf

由 Brian Gerst 提交于 3月 13, 2020

After removal of the __ia32_ prefix, remove compat entries that are now
identical to the native entry.

Converted with this script and fixing up whitespace:

while read nr abi name entry compat; do
    if [ "${nr:0:1}" = "#" ]; then
        echo $nr $abi $name $entry $compat
        continue
    fi
    if [ "$entry" = "$compat" ]; then
        compat=""
    fi
    echo "$nr	$abi	$name		$entry		$compat"
done
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-14-brgerst@gmail.com

a845a6cf

x86/entry: Remove ABI prefixes from functions in syscall tables · cab56d34

由 Brian Gerst 提交于 3月 13, 2020

Move the ABI prefixes to the __SYSCALL_[abi]() macros.  This allows removal
of the need to strip the prefix for UML.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-13-brgerst@gmail.com

cab56d34

x86/entry/64: Add __SYSCALL_COMMON() · 8210efcb

由 Brian Gerst 提交于 3月 13, 2020

Add a __SYSCALL_COMMON() macro to the syscall table, which simplifies syscalltbl.sh.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-12-brgerst@gmail.com

8210efcb

x86/entry: Remove syscall qualifier support · b5592e5c

由 Brian Gerst 提交于 3月 13, 2020

Syscall qualifier support is no longer needed.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-11-brgerst@gmail.com

b5592e5c

x86/entry/64: Remove ptregs qualifier from syscall table · d3b1b776

由 Brian Gerst 提交于 3月 13, 2020

Now that the fast syscall path is removed, the ptregs qualifier is unused.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-10-brgerst@gmail.com

d3b1b776

x86/entry: Move max syscall number calculation to syscallhdr.sh · 08720988

由 Brian Gerst 提交于 3月 13, 2020

Instead of using an array in asm-offsets to calculate the max syscall
number, calculate it when writing out the syscall headers.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-9-brgerst@gmail.com

08720988

x86/entry/64: Split X32 syscall table into its own file · 2e487c35

由 Brian Gerst 提交于 3月 13, 2020

Since X32 has its own syscall table now, move it to a separate file.
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Link: https://lkml.kernel.org/r/20200313195144.164260-8-brgerst@gmail.com

2e487c35

x86/entry/64: Move sys_ni_syscall stub to common.c · cc42c045

由 Brian Gerst 提交于 3月 13, 2020

so it can be available to multiple syscall tables.  Also directly return
-ENOSYS instead of bouncing to the generic sys_ni_syscall().
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20200313195144.164260-7-brgerst@gmail.com

cc42c045

x86/entry/64: Use syscall wrappers for x32_rt_sigreturn · 27dd84fa

由 Brian Gerst 提交于 3月 13, 2020

Add missing syscall wrapper for x32_rt_sigreturn().
Signed-off-by: NBrian Gerst <brgerst@gmail.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200313195144.164260-6-brgerst@gmail.com

27dd84fa

10 3月, 2020 2 次提交

x86/entry/64: Trace irqflags unconditionally as ON when returning to user space · 810f80a6

由 Thomas Gleixner 提交于 3月 08, 2020

User space cannot disable interrupts any longer so trace return to user space
unconditionally as IRQS_ON.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200308222609.314596327@linutronix.de

810f80a6

x86/entry/32: Remove unused label restore_nocheck · 74a4882d

由 Thomas Gleixner 提交于 3月 08, 2020

Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/20200308222609.219366430@linutronix.de

74a4882d

29 2月, 2020 1 次提交

x86/entry/32: Remove the 0/-1 distinction from exception entries · e441a2ae

由 Thomas Gleixner 提交于 2月 27, 2020

Nothing cares about the -1 "mark as interrupt" in the errorcode of
exception entries. It's only used to fill the error code when a signal is
delivered, but this is already inconsistent vs. 64 bit as there all
exceptions which do not have an error code set it to 0. So if 32 bit
applications would care about this, then they would have noticed more than
a decade ago.

Just use 0 for all excpetions which do not have an errorcode consistently.

This does neither break /proc/$PID/syscall because this interface examines
the error code / syscall number which is on the stack and that is set to -1
(no syscall) in common_exception unconditionally for all exceptions. The
push in the entry stub is just there to fill the hardware error code slot
on the stack for consistency of the stack layout.

A transient observation of 0 is possible, but that's true for the other
exceptions which use 0 already as well and that interface is an unreliable
snapshot of dubious correctness anyway.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Link: https://lkml.kernel.org/r/87mu94m7ky.fsf@nanos.tec.linutronix.de

e441a2ae

27 2月, 2020 3 次提交

x86/entry/entry_32: Route int3 through common_exception · ac3607f9

由 Thomas Gleixner 提交于 2月 25, 2020

int3 is not using the common_exception path for purely historical reasons,
but there is no reason to keep it the only exception which is different.

Make it use common_exception so the upcoming changes to autogenerate the
entry stubs do not have to special case int3.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220217.042369808@linutronix.de

ac3607f9

x86/entry/32: Force MCE through do_mce() · 840371be

由 Thomas Gleixner 提交于 2月 25, 2020

Remove the pointless difference between 32 and 64 bit to make further
unifications simpler.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/20200225220216.428188397@linutronix.de

840371be

x86/entry/32: Add missing ASM_CLAC to general_protection entry · 3d51507f

由 Thomas Gleixner 提交于 2月 25, 2020

All exception entry points must have ASM_CLAC right at the
beginning. The general_protection entry is missing one.

Fixes: e59d1b0a ("x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry")
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20200225220216.219537887@linutronix.de

3d51507f

18 2月, 2020 1 次提交

x86/syscalls: Add prototypes for C syscall callbacks · 99ce3255

由 Benjamin Thiel 提交于 1月 23, 2020

.. in order to fix a couple of -Wmissing-prototypes warnings.

No functional change.

 [ bp: Massage commit message and drop newlines. ]
Signed-off-by: NBenjamin Thiel <b.thiel@posteo.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200123152754.20149-1-b.thiel@posteo.de

99ce3255

17 2月, 2020 2 次提交

x86/vdso: Use generic VDSO clock mode storage · b95a8a27

由 Thomas Gleixner 提交于 2月 07, 2020

Switch to the generic VDSO clock mode storage.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> (VDSO parts)
Acked-by: Juergen Gross <jgross@suse.com> (Xen parts)
Acked-by: Paolo Bonzini <pbonzini@redhat.com> (KVM parts)
Link: https://lkml.kernel.org/r/20200207124403.152039903@linutronix.de

b95a8a27

x86/vdso: Move VDSO clocksource state tracking to callback · eec399dd

由 Thomas Gleixner 提交于 2月 07, 2020

All architectures which use the generic VDSO code have their own storage
for the VDSO clock mode. That's pointless and just requires duplicate code.

X86 abuses the function which retrieves the architecture specific clock
mode storage to mark the clocksource as used in the VDSO. That's silly
because this is invoked on every tick when the VDSO data is updated.

Move this functionality to the clocksource::enable() callback so it gets
invoked once when the clocksource is installed. This allows to make the
clock mode storage generic.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> (Hyper-V parts)
Reviewed-by: Vincenzo Frascino <vincenzo.frascino@arm.com> (VDSO parts)
Acked-by: Juergen Gross <jgross@suse.com> (Xen parts)
Link: https://lkml.kernel.org/r/20200207124402.934519777@linutronix.de

eec399dd

04 2月, 2020 1 次提交

kbuild: rename hostprogs-y/always to hostprogs/always-y · 5f2fb52f

由 Masahiro Yamada 提交于 2月 02, 2020

In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.

It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.

This commit renames like follows:

  always       ->  always-y
  hostprogs-y  ->  hostprogs

So, scripts/Makefile will look like this:

  always-$(CONFIG_BUILD_BIN2C) += ...
  always-$(CONFIG_KALLSYMS)    += ...
      ...
  hostprogs := $(always-y) $(always-m)

I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.

The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.
Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org>

5f2fb52f

18 1月, 2020 1 次提交

open: introduce openat2(2) syscall · fddb5d43

由 Aleksa Sarai 提交于 1月 18, 2020

/* Background. */
For a very long time, extending openat(2) with new features has been
incredibly frustrating. This stems from the fact that openat(2) is
possibly the most famous counter-example to the mantra "don't silently
accept garbage from userspace" -- it doesn't check whether unknown flags
are present[1].

This means that (generally) the addition of new flags to openat(2) has
been fraught with backwards-compatibility issues (O_TMPFILE has to be
defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
kernels gave errors, since it's insecure to silently ignore the
flag[2]). All new security-related flags therefore have a tough road to
being added to openat(2).

Userspace also has a hard time figuring out whether a particular flag is
supported on a particular kernel. While it is now possible with
contemporary kernels (thanks to [3]), older kernels will expose unknown
flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
openat(2) time matches modern syscall designs and is far more
fool-proof.

In addition, the newly-added path resolution restriction LOOKUP flags
(which we would like to expose to user-space) don't feel related to the
pre-existing O_* flag set -- they affect all components of path lookup.
We'd therefore like to add a new flag argument.

Adding a new syscall allows us to finally fix the flag-ignoring problem,
and we can make it extensible enough so that we will hopefully never
need an openat3(2).

/* Syscall Prototype. */
  /*
   * open_how is an extensible structure (similar in interface to
   * clone3(2) or sched_setattr(2)). The size parameter must be set to
   * sizeof(struct open_how), to allow for future extensions. All future
   * extensions will be appended to open_how, with their zero value
   * acting as a no-op default.
   */
  struct open_how { /* ... */ };

  int openat2(int dfd, const char *pathname,
              struct open_how *how, size_t size);

/* Description. */
The initial version of 'struct open_how' contains the following fields:

  flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

  mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

  resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV       => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS   => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH       => LOOKUP_BENEATH
    RESOLVE_IN_ROOT       => LOOKUP_IN_ROOT

open_how does not contain an embedded size field, because it is of
little benefit (userspace can figure out the kernel open_how size at
runtime fairly easily without it). It also only contains u64s (even
though ->mode arguably should be a u16) to avoid having padding fields
which are never used in the future.

Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
is no longer permitted for openat(2). As far as I can tell, this has
always been a bug and appears to not be used by userspace (and I've not
seen any problems on my machines by disallowing it). If it turns out
this breaks something, we can special-case it and only permit it for
openat(2) but not openat2(2).

After input from Florian Weimer, the new open_how and flag definitions
are inside a separate header from uapi/linux/fcntl.h, to avoid problems
that glibc has with importing that header.

/* Testing. */
In a follow-up patch there are over 200 selftests which ensure that this
syscall has the correct semantics and will correctly handle several
attack scenarios.

In addition, I've written a userspace library[4] which provides
convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
syscalls). During the development of this patch, I've run numerous
verification tests using libpathrs (showing that the API is reasonably
usable by userspace).

/* Future Work. */
Additional RESOLVE_ flags have been suggested during the review period.
These can be easily implemented separately (such as blocking auto-mount
during resolution).

Furthermore, there are some other proposed changes to the openat(2)
interface (the most obvious example is magic-link hardening[5]) which
would be a good opportunity to add a way for userspace to restrict how
O_PATH file descriptors can be re-opened.

Another possible avenue of future work would be some kind of
CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
which openat2(2) flags and fields are supported by the current kernel
(to avoid userspace having to go through several guesses to figure it
out).

[1]: https://lwn.net/Articles/588444/
[2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
[3]: commit 629e014b ("fs: completely ignore unknown open flags")
[4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
[5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
[6]: https://youtu.be/ggD-eb3yPVsSuggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fddb5d43

14 1月, 2020 7 次提交

x86/vdso: Zap vvar pages when switching to a time namespace · 70ddf651

由 Dmitry Safonov 提交于 11月 12, 2019

The VVAR page layout depends on whether a task belongs to the root or
non-root time namespace. Whenever a task changes its namespace, the VVAR
page tables are cleared and then they will be re-faulted with a
corresponding layout.
Co-developed-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-27-dima@arista.com

70ddf651

x86/vdso: On timens page fault prefault also VVAR page · e6b28ec6

由 Dmitry Safonov 提交于 11月 12, 2019

As timens page has offsets to data on VVAR page VVAR is going
to be accessed shortly. Set it up with timens in one page fault
as optimization.
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Co-developed-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-26-dima@arista.com

e6b28ec6

x86/vdso: Handle faults on timens page · af34ebeb

由 Dmitry Safonov 提交于 11月 12, 2019

If a task belongs to a time namespace then the VVAR page which contains
the system wide VDSO data is replaced with a namespace specific page
which has the same layout as the VVAR page.
Co-developed-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-25-dima@arista.com

af34ebeb

x86/vdso: Add time napespace page · 550a77a7

由 Dmitry Safonov 提交于 11月 12, 2019

To support time namespaces in the VDSO with a minimal impact on regular non
time namespace affected tasks, the namespace handling needs to be hidden in
a slow path.

The most obvious place is vdso_seq_begin(). If a task belongs to a time
namespace then the VVAR page which contains the system wide VDSO data is
replaced with a namespace specific page which has the same layout as the
VVAR page. That page has vdso_data->seq set to 1 to enforce the slow path
and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce the time
namespace handling path.

The extra check in the case that vdso_data->seq is odd, e.g. a concurrent
update of the VDSO data is in progress, is not really affecting regular
tasks which are not part of a time namespace as the task is spin waiting
for the update to finish and vdso_data->seq to become even again.

If a time namespace task hits that code path, it invokes the corresponding
time getter function which retrieves the real VVAR page, reads host time
and then adds the offset for the requested clock which is stored in the
special VVAR page.

Allocate the time namespace page among VVAR pages and place vdso_data on
it. Provide __arch_get_timens_vdso_data() helper for VDSO code to get the
code-relative position of VVARs on that special page.
Co-developed-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-23-dima@arista.com

550a77a7

x86/vdso: Provide vdso_data offset on vvar_page · 64b302ab

由 Dmitry Safonov 提交于 11月 12, 2019

VDSO support for time namespaces needs to set up a page with the same
layout as VVAR. That timens page will be placed on position of VVAR page
inside namespace. That page has vdso_data->seq set to 1 to enforce
the slow path and vdso_data->clock_mode set to VCLOCK_TIMENS to enforce
the time namespace handling path.

To prepare the time namespace page the kernel needs to know the vdso_data
offset.  Provide arch_get_vdso_data() helper for locating vdso_data on VVAR
page.
Co-developed-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-22-dima@arista.com

64b302ab

x86/vdso: Restrict splitting VVAR VMA · 6f74acfd

由 Dmitry Safonov 提交于 11月 12, 2019

Forbid splitting VVAR VMA resulting in a stricter ABI and reducing the
amount of corner-cases to consider while working further on VDSO time
namespace support.

As the offset from timens to VVAR page is computed compile-time, the pages
in VVAR should stay together and not being partically mremap()'ed.
Co-developed-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NDmitry Safonov <dima@arista.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20191112012724.250792-20-dima@arista.com

6f74acfd

arch: wire up pidfd_getfd syscall · 9a2cef09

由 Sargun Dhillon 提交于 1月 07, 2020

This wires up the pidfd_getfd syscall for all architectures.
Signed-off-by: NSargun Dhillon <sargun@sargun.me>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20200107175927.4558-4-sargun@sargun.meSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>

9a2cef09

09 1月, 2020 1 次提交

x86/entry/64: Add instruction suffix to SYSRET · b2b1d94c

由 Jan Beulich 提交于 12月 16, 2019

ignore_sysret() contains an unsuffixed SYSRET instruction. gas correctly
interprets this as SYSRETL, but leaving it up to gas to guess when there
is no register operand that implies a size is bad practice, and upstream
gas is likely to warn about this in the future. Use SYSRETL explicitly.
This does not change the assembled output.
Signed-off-by: NJan Beulich <jbeulich@suse.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Acked-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lkml.kernel.org/r/038a7c35-062b-a285-c6d2-653b56585844@suse.com

b2b1d94c

29 12月, 2019 1 次提交

x86/vdso: Provide missing include file · bff47c23

由 Valdis Klētnieks 提交于 12月 05, 2019

When building with C=1, sparse issues a warning:

  CHECK   arch/x86/entry/vdso/vdso32-setup.c
  arch/x86/entry/vdso/vdso32-setup.c:28:28: warning: symbol 'vdso32_enabled' was not declared. Should it be static?

Provide the missing header file.
Signed-off-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/36224.1575599767@turing-police

bff47c23

27 11月, 2019 2 次提交

x86/entry/32: Remove unused 'restore_all_notrace' local label · 3e1b4358

由 Borislav Petkov 提交于 11月 24, 2019

Signed-off-by: NBorislav Petkov <bp@alien8.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

3e1b4358

x86/doublefault/32: Rewrite the x86_32 #DF handler and unify with 64-bit · 7d8d8cfd

由 Andy Lutomirski 提交于 11月 20, 2019

The old x86_32 doublefault_fn() was old and crufty, and it did not
even try to recover.  do_double_fault() is much nicer.  Rewrite the
32-bit double fault code to sanitize CPU state and call
do_double_fault().  This is mostly an exercise i386 archaeology.

With this patch applied, 32-bit double faults get a real stack trace,
just like 64-bit double faults.

[ mingo: merged the patch to a later kernel base. ]
Signed-off-by: NAndy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

7d8d8cfd

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功