提交 · 1446e1df9eb183fdf81c3f0715402f1d7595d4cb · openeuler / Kernel

02 12月, 2020 1 次提交

kernel: Implement selective syscall userspace redirection · 1446e1df

由 Gabriel Krisman Bertazi 提交于 11月 27, 2020

Introduce a mechanism to quickly disable/enable syscall handling for a
specific process and redirect to userspace via SIGSYS.  This is useful
for processes with parts that require syscall redirection and parts that
don't, but who need to perform this boundary crossing really fast,
without paying the cost of a system call to reconfigure syscall handling
on each boundary transition.  This is particularly important for Windows
games running over Wine.

The proposed interface looks like this:

  prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <off>, <length>, [selector])

The range [<offset>,<offset>+<length>) is a part of the process memory
map that is allowed to by-pass the redirection code and dispatch
syscalls directly, such that in fast paths a process doesn't need to
disable the trap nor the kernel has to check the selector.  This is
essential to return from SIGSYS to a blocked area without triggering
another SIGSYS from rt_sigreturn.

selector is an optional pointer to a char-sized userspace memory region
that has a key switch for the mechanism. This key switch is set to
either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the
redirection without calling the kernel.

The feature is meant to be set per-thread and it is disabled on
fork/clone/execv.

Internally, this doesn't add overhead to the syscall hot path, and it
requires very little per-architecture support.  I avoided using seccomp,
even though it duplicates some functionality, due to previous feedback
that maybe it shouldn't mix with seccomp since it is not a security
mechanism.  And obviously, this should never be considered a security
mechanism, since any part of the program can by-pass it by using the
syscall dispatcher.

For the sysinfo benchmark, which measures the overhead added to
executing a native syscall that doesn't require interception, the
overhead using only the direct dispatcher region to issue syscalls is
pretty much irrelevant.  The overhead of using the selector goes around
40ns for a native (unredirected) syscall in my system, and it is (as
expected) dominated by the supervisor-mode user-address access.  In
fact, with SMAP off, the overhead is consistently less than 5ns on my
test box.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201127193238.821364-4-krisman@collabora.com

1446e1df

25 11月, 2020 1 次提交

entry: Fix boot for !CONFIG_GENERIC_ENTRY · 5903f61e

由 Gabriel Krisman Bertazi 提交于 11月 23, 2020

A copy-pasta mistake tries to set SYSCALL_WORK flags instead of TIF
flags for !CONFIG_GENERIC_ENTRY.  Also, add safeguards to catch this at
compilation time.

Fixes: 3136b93c ("entry: Expose helpers to migrate TIF to SYSCALL_WORK flags")
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Suggested-by: NJann Horn <jannh@google.com>
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/87a6v8qd9p.fsf_-_@collabora.com

5903f61e

19 11月, 2020 1 次提交

context_tracking: Don't implement exception_enter/exit() on CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK · 179a9cf7

由 Frederic Weisbecker 提交于 11月 17, 2020

The typical steps with context tracking are:

1) Task runs in userspace
2) Task enters the kernel (syscall/exception/IRQ)
3) Task switches from context tracking state CONTEXT_USER to
   CONTEXT_KERNEL (user_exit())
4) Task does stuff in kernel
5) Task switches from context tracking state CONTEXT_KERNEL to
   CONTEXT_USER (user_enter())
6) Task exits the kernel

If an exception fires between 5) and 6), the pt_regs and the context
tracking disagree on the context of the faulted/trapped instruction.
CONTEXT_KERNEL must be set before the exception handler, that's
unconditional for those handlers that want to be able to call into
schedule(), but CONTEXT_USER must be restored when the exception exits
whereas pt_regs tells that we are resuming to kernel space.

This can't be fixed with storing the context tracking state in a per-cpu
or per-task variable since another exception may fire onto the current
one and overwrite the saved state. Also the task can schedule. So it
has to be stored in a per task stack.

This is how exception_enter()/exception_exit() paper over the problem:

5) Task switches from context tracking state CONTEXT_KERNEL to
   CONTEXT_USER (user_enter())
5.1) Exception fires
5.2) prev_state = exception_enter() // save CONTEXT_USER to prev_state
                                    // and set CONTEXT_KERNEL
5.3) Exception handler
5.4) exception_enter(prev_state) // restore CONTEXT_USER
5.5) Exception resumes
6) Task exits the kernel

The condition to live without exception_enter()/exception_exit() is to
forbid exceptions and IRQs between 2) and 3) and between 5) and 6), or if
any is allowed to trigger, it won't call into context tracking, eg: NMIs,
and it won't schedule. These requirements are met by architectures
supporting CONFIG_HAVE_CONTEXT_TRACKING_OFFSTACK and those can
therefore afford not to implement this hack.
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201117151637.259084-3-frederic@kernel.org

179a9cf7

17 11月, 2020 8 次提交

entry: Drop usage of TIF flags in the generic syscall code · 29915524

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

Now that the flags migration in the common syscall entry code is complete
and the code relies exclusively on thread_info::syscall_work, clean up the
accesses to TI flags in that path.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-10-krisman@collabora.com

29915524

audit: Migrate to use SYSCALL_WORK flag · 785dc4eb

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SYSCALL_AUDIT, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-9-krisman@collabora.com

785dc4eb

ptrace: Migrate TIF_SYSCALL_EMU to use SYSCALL_WORK flag · 64eb35f7

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

Define SYSCALL_WORK_SYSCALL_EMU, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-8-krisman@collabora.com

64eb35f7

ptrace: Migrate to use SYSCALL_TRACE flag · 64c19ba2

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

Define SYSCALL_WORK_SYSCALL_TRACE, use it in the generic entry code and
convert the code which uses the TIF specific helper functions to use the
new *_syscall_work() helpers which either resolve to the new mode for users
of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-7-krisman@collabora.com

64c19ba2

tracepoints: Migrate to use SYSCALL_WORK flag · 524666cb

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

Define SYSCALL_WORK_SYSCALL_TRACEPOINT, use it in the generic entry code
and convert the code which uses the TIF specific helper functions to use
the new *_syscall_work() helpers which either resolve to the new mode for
users of the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-6-krisman@collabora.com

524666cb

seccomp: Migrate to use SYSCALL_WORK flag · 23d67a54

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

On architectures using the generic syscall entry code the architecture
independent syscall work is moved to flags in thread_info::syscall_work.
This removes architecture dependencies and frees up TIF bits.

Define SYSCALL_WORK_SECCOMP, use it in the generic entry code and convert
the code which uses the TIF specific helper functions to use the new
*_syscall_work() helpers which either resolve to the new mode for users of
the generic entry code or to the TIF based functions for the other
architectures.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-5-krisman@collabora.com

23d67a54

entry: Wire up syscall_work in common entry code · b86678cf

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

Prepare the common entry code to use the SYSCALL_WORK flags. They will
be defined in subsequent patches for each type of syscall
work. SYSCALL_WORK_ENTRY/EXIT are defined for the transition, as they
will replace the TIF_ equivalent defines.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-4-krisman@collabora.com

b86678cf

entry: Expose helpers to migrate TIF to SYSCALL_WORK flags · 3136b93c

由 Gabriel Krisman Bertazi 提交于 11月 16, 2020

With the goal to split the syscall work related flags into a separate
field that is architecture independent, expose transitional helpers that
resolve to either the TIF flags or to the corresponding SYSCALL_WORK
flags. This will allow architectures to migrate only when they port to
the generic syscall entry code.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NAndy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20201116174206.2639648-3-krisman@collabora.com

3136b93c

16 11月, 2020 1 次提交

entry: Fix spelling/typo errors in irq entry code · 78a56e04

由 Ira Weiny 提交于 11月 04, 2020

s/reguired/required/
s/Interupts/Interrupts/
s/quiescient/quiescent/
s/assemenbly/assembly/
Signed-off-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201104230157.3378023-1-ira.weiny@intel.com

78a56e04

05 11月, 2020 1 次提交

x86/entry: Move nmi entry/exit into common code · b6be002b

由 Thomas Gleixner 提交于 11月 02, 2020

Lockdep state handling on NMI enter and exit is nothing specific to X86. It's
not any different on other architectures. Also the extra state type is not
necessary, irqentry_state_t can carry the necessary information as well.

Move it to common code and extend irqentry_state_t to carry lockdep state.

[ Ira: Make exit_rcu and lockdep a union as they are mutually exclusive
  between the IRQ and NMI exceptions, and add kernel documentation for
  struct irqentry_state_t ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIra Weiny <ira.weiny@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201102205320.1458656-7-ira.weiny@intel.com

b6be002b

31 10月, 2020 1 次提交

net/mlx5: Replace zero-length array with flexible-array member · 29056207

由 Gustavo A. R. Silva 提交于 10月 27, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

29056207

30 10月, 2020 7 次提交

debugfs: remove return value of debugfs_create_devm_seqfile() · 0d519cbf

由 Greg Kroah-Hartman 提交于 10月 23, 2020

No one checks the return value of debugfs_create_devm_seqfile(), as it's
not needed, so make the return value void, so that no one tries to do so
in the future.

Link: https://lore.kernel.org/r/20201023131037.2500765-1-gregkh@linuxfoundation.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

0d519cbf

fs: Replace zero-length array with flexible-array member · 5e01fdff

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

5e01fdff

platform/chrome: cros_ec_proto: Replace zero-length array with flexible-array member · 12008883

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

12008883

platform/chrome: cros_ec_commands: Replace zero-length array with flexible-array member · 88354105

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

88354105

mailbox: zynqmp-ipi-message: Replace zero-length array with flexible-array member · 277ffd6c

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

277ffd6c

dmaengine: ti-cppi5: Replace zero-length array with flexible-array member · a4147d85

由 Gustavo A. R. Silva 提交于 8月 31, 2020

There is a regular need in the kernel to provide a way to declare having a
dynamically sized set of trailing elements in a structure. Kernel code should
always use “flexible array members”[1] for these cases. The older style of
one-element or zero-length arrays should no longer be used[2].

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arraysSigned-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

a4147d85

include: jhash/signal: Fix fall-through warnings for Clang · 4169e889

由 Gustavo A. R. Silva 提交于 9月 02, 2020

In preparation to enable -Wimplicit-fallthrough for Clang, explicitly
add break statements instead of letting the code fall through to the
next case.

This patch adds four break statements that, together, fix almost 40,000
warnings when building Linux 5.10-rc1 with Clang 12.0.0 and this[1] change
reverted. Notice that in order to enable -Wimplicit-fallthrough for Clang,
such change[1] is meant to be reverted at some point. So, this patch helps
to move in that direction.

Something important to mention is that there is currently a discrepancy
between GCC and Clang when dealing with switch fall-through to empty case
statements or to cases that only contain a break/continue/return
statement[2][3][4].

Now that the -Wimplicit-fallthrough option has been globally enabled[5],
any compiler should really warn on missing either a fallthrough annotation
or any of the other case-terminating statements (break/continue/return/
goto) when falling through to the next case statement. Making exceptions
to this introduces variation in case handling which may continue to lead
to bugs, misunderstandings, and a general lack of robustness. The point
of enabling options like -Wimplicit-fallthrough is to prevent human error
and aid developers in spotting bugs before their code is even built/
submitted/committed, therefore eliminating classes of bugs. So, in order
to really accomplish this, we should, and can, move in the direction of
addressing any error-prone scenarios and get rid of the unintentional
fallthrough bug-class in the kernel, entirely, even if there is some minor
redundancy. Better to have explicit case-ending statements than continue to
have exceptions where one must guess as to the right result. The compiler
will eliminate any actual redundancy.

[1] commit e2079e93 ("kbuild: Do not enable -Wimplicit-fallthrough for clang for now")
[2] https://github.com/ClangBuiltLinux/linux/issues/636
[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91432
[4] https://godbolt.org/z/xgkvIh
[5] commit a035d552 ("Makefile: Globally enable fall-through warning")
Co-developed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>

4169e889

29 10月, 2020 6 次提交

cpufreq: Introduce cpufreq_driver_test_flags() · a62f68f5

由 Rafael J. Wysocki 提交于 10月 23, 2020

Add a helper function to test the flags of the cpufreq driver in use
againt a given flags mask.

In particular, this will be needed to test the
CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag in the schedutil
governor.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

a62f68f5

entry: Add support for TIF_NOTIFY_SIGNAL · 12db8b69

由 Jens Axboe 提交于 10月 26, 2020

Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.

In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart().  Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.

For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk

12db8b69

signal: Add task_sigpending() helper · 5c251e9d

由 Jens Axboe 提交于 10月 26, 2020

This is in preparation for maintaining signal_pending() as the decider of
whether or not a schedule() loop should be broken, or continue sleeping.
This is different than the core signal use cases, which really need to know
whether an actual signal is pending or not. task_sigpending() returns
non-zero if TIF_SIGPENDING is set.

Only core kernel use cases should care about the distinction between
the two, make sure those use the task_sigpending() helper.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NOleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-2-axboe@kernel.dk

5c251e9d

misc: mic: remove the MIC drivers · 80ade22c

由 Sudeep Dutt 提交于 10月 27, 2020

This patch removes the MIC drivers from the kernel tree
since the corresponding devices have been discontinued.

Removing the dma and char-misc changes in one patch and
merging via the char-misc tree is best to avoid any
potential build breakage.

Cc: Nikhil Rao <nikhil.rao@intel.com>
Reviewed-by: NAshutosh Dixit <ashutosh.dixit@intel.com>
Signed-off-by: NSudeep Dutt <sudeep.dutt@intel.com>
Acked-By: NVinod Koul <vkoul@kernel.org>
Reviewed-by: NSherry Sun <sherry.sun@nxp.com>
Link: https://lore.kernel.org/r/8c1443136563de34699d2c084df478181c205db4.1603854416.git.sudeep.dutt@intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

80ade22c

jbd2: fix a kernel-doc markup · ea4b01d9

由 Mauro Carvalho Chehab 提交于 10月 27, 2020

The kernel-doc markup that documents _fc_replay_callback is
missing an asterisk, causing this warning:

	../include/linux/jbd2.h:1271: warning: Function parameter or member 'j_fc_replay_callback' not described in 'journal_s'

When building the docs.

Fixes: 609f928af48f ("jbd2: fast commit recovery path")
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/6055927ada2015b55b413cdd2670533bdc9a8da2.1603791716.git.mchehab+huawei@kernel.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

ea4b01d9

ext4: make num of fast commit blocks configurable · e029c5f2

由 Harshad Shirwadkar 提交于 10月 26, 2020

This patch reserves a field in the jbd2 superblock for number of fast
commit blocks. When this value is non-zero, Ext4 uses this field to
set the number of fast commit blocks.

Fixes: 6866d7b3 ("ext4/jbd2: add fast commit initialization")
Signed-off-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Link: https://lore.kernel.org/r/20201027044915.2553163-2-harshadshirwadkar@gmail.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

e029c5f2

28 10月, 2020 4 次提交

module: use hidden visibility for weak symbol references · 13150bc5

由 Ard Biesheuvel 提交于 10月 27, 2020

Geert reports that commit be288182 ("arm64/build: Assert for
unwanted sections") results in build errors on arm64 for configurations
that have CONFIG_MODULES disabled.

The commit in question added ASSERT()s to the arm64 linker script to
ensure that linker generated sections such as .got.plt etc are empty,
but as it turns out, there are corner cases where the linker does emit
content into those sections. More specifically, weak references to
function symbols (which can remain unsatisfied, and can therefore not
be emitted as relative references) will be emitted as GOT and PLT
entries when linking the kernel in PIE mode (which is the case when
CONFIG_RELOCATABLE is enabled, which is on by default).

What happens is that code such as

	struct device *(*fn)(struct device *dev);
	struct device *iommu_device;

	fn = symbol_get(mdev_get_iommu_device);
	if (fn) {
		iommu_device = fn(dev);

essentially gets converted into the following when CONFIG_MODULES is off:

	struct device *iommu_device;

	if (&mdev_get_iommu_device) {
		iommu_device = mdev_get_iommu_device(dev);

where mdev_get_iommu_device is emitted as a weak symbol reference into
the object file. The first reference is decorated with an ordinary
ABS64 data relocation (which yields 0x0 if the reference remains
unsatisfied). However, the indirect call is turned into a direct call
covered by a R_AARCH64_CALL26 relocation, which is converted into a
call via a PLT entry taking the target address from the associated
GOT entry.

Given that such GOT and PLT entries are unnecessary for fully linked
binaries such as the kernel, let's give these weak symbol references
hidden visibility, so that the linker knows that the weak reference
via R_AARCH64_CALL26 can simply remain unsatisfied.
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: NFangrui Song <maskray@google.com>
Acked-by: NJessica Yu <jeyu@kernel.org>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20201027151132.14066-1-ardb@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>

13150bc5

usb: fix kernel-doc markups · cbdc0f54

由 Mauro Carvalho Chehab 提交于 10月 23, 2020

There is a common comment marked, instead, with kernel-doc
notation.

Also, some identifiers have different names between their
prototypes and the kernel-doc markup.
Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: NFelipe Balbi <balbi@kernel.org>
Link: https://lore.kernel.org/r/0b964be3884def04fcd20ea5c12cb90d0014871c.1603469755.git.mchehab+huawei@kernel.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

cbdc0f54

KVM: arm64: ARM_SMCCC_ARCH_WORKAROUND_1 doesn't return SMCCC_RET_NOT_REQUIRED · 1de111b5

由 Stephen Boyd 提交于 10月 23, 2020

According to the SMCCC spec[1](7.5.2 Discovery) the
ARM_SMCCC_ARCH_WORKAROUND_1 function id only returns 0, 1, and
SMCCC_RET_NOT_SUPPORTED.

 0 is "workaround required and safe to call this function"
 1 is "workaround not required but safe to call this function"
 SMCCC_RET_NOT_SUPPORTED is "might be vulnerable or might not be, who knows, I give up!"

SMCCC_RET_NOT_SUPPORTED might as well mean "workaround required, except
calling this function may not work because it isn't implemented in some
cases". Wonderful. We map this SMC call to

 0 is SPECTRE_MITIGATED
 1 is SPECTRE_UNAFFECTED
 SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE

For KVM hypercalls (hvc), we've implemented this function id to return
SMCCC_RET_NOT_SUPPORTED, 0, and SMCCC_RET_NOT_REQUIRED. One of those
isn't supposed to be there. Per the code we call
arm64_get_spectre_v2_state() to figure out what to return for this
feature discovery call.

 0 is SPECTRE_MITIGATED
 SMCCC_RET_NOT_REQUIRED is SPECTRE_UNAFFECTED
 SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE

Let's clean this up so that KVM tells the guest this mapping:

 0 is SPECTRE_MITIGATED
 1 is SPECTRE_UNAFFECTED
 SMCCC_RET_NOT_SUPPORTED is SPECTRE_VULNERABLE

Note: SMCCC_RET_NOT_AFFECTED is 1 but isn't part of the SMCCC spec

Fixes: c118bbb5 ("arm64: KVM: Propagate full Spectre v2 workaround state to KVM guests")
Signed-off-by: NStephen Boyd <swboyd@chromium.org>
Acked-by: NMarc Zyngier <maz@kernel.org>
Acked-by: NWill Deacon <will@kernel.org>
Cc: Andre Przywara <andre.przywara@arm.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://developer.arm.com/documentation/den0028/latest [1]
Link: https://lore.kernel.org/r/20201023154751.1973872-1-swboyd@chromium.orgSigned-off-by: NWill Deacon <will@kernel.org>

1de111b5

cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS driver flag · 1c534352

由 Rafael J. Wysocki 提交于 10月 23, 2020

Generally, a cpufreq driver may need to update some internal upper
and lower frequency boundaries on policy max and min changes,
respectively, but currently this does not work if the target
frequency does not change along with the policy limit.

Namely, if the target frequency does not change along with the
policy min or max, the "target_freq == policy->cur" check in
__cpufreq_driver_target() prevents driver callbacks from being
invoked and they do not even have a chance to update the
corresponding internal boundary.

This particularly affects the "powersave" and "performance"
governors that always set the target frequency to one of the
policy limits and it never changes when the other limit is updated.

To allow cpufreq the drivers needing to update internal frequency
boundaries on policy limits changes to avoid this issue, introduce
a new driver flag, CPUFREQ_NEED_UPDATE_LIMITS, that (when set) will
neutralize the check mentioned above.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NViresh Kumar <viresh.kumar@linaro.org>

1c534352

27 10月, 2020 1 次提交

RDMA/mlx5: Fix devlink deadlock on net namespace deletion · fbdd0049

由 Parav Pandit 提交于 10月 26, 2020

When a mlx5 core devlink instance is reloaded in different net namespace,
its associated IB device is deleted and recreated.

Example sequence is:
$ ip netns add foo
$ devlink dev reload pci/0000:00:08.0 netns foo
$ ip netns del foo

mlx5 IB device needs to attach and detach the netdevice to it through the
netdev notifier chain during load and unload sequence.  A below call graph
of the unload flow.

cleanup_net()
   down_read(&pernet_ops_rwsem); <- first sem acquired
     ops_pre_exit_list()
       pre_exit()
         devlink_pernet_pre_exit()
           devlink_reload()
             mlx5_devlink_reload_down()
               mlx5_unload_one()
               [...]
                 mlx5_ib_remove()
                   mlx5_ib_unbind_slave_port()
                     mlx5_remove_netdev_notifier()
                       unregister_netdevice_notifier()
                         down_write(&pernet_ops_rwsem);<- recurrsive lock

Hence, when net namespace is deleted, mlx5 reload results in deadlock.

When deadlock occurs, devlink mutex is also held. This not only deadlocks
the mlx5 device under reload, but all the processes which attempt to
access unrelated devlink devices are deadlocked.

Hence, fix this by mlx5 ib driver to register for per net netdev notifier
instead of global one, which operats on the net namespace without holding
the pernet_ops_rwsem.

Fixes: 4383cfcc ("net/mlx5: Add devlink reload")
Link: https://lore.kernel.org/r/20201026134359.23150-1-parav@nvidia.comSigned-off-by: NParav Pandit <parav@nvidia.com>
Signed-off-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>

fbdd0049

26 10月, 2020 5 次提交

elf: Expose ELF header on arch_setup_additional_pages() · 9a29a671

由 Gabriel Krisman Bertazi 提交于 10月 03, 2020

Like it is done for SET_PERSONALITY with ARM, which requires the ELF
header to select correct personality parameters, x86 requires the
headers when selecting which VDSO to load, instead of relying on the
going-away TIF_IA32/X32 flags.

Add an indirection macro to arch_setup_additional_pages(), that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures, who can continue to use the
original arch_setup_additional_pages for ELF and non-ELF binaries.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-8-krisman@collabora.com

9a29a671

elf: Expose ELF header in compat_start_thread() · bc3d7bf6

由 Gabriel Krisman Bertazi 提交于 10月 03, 2020

Like it is done for SET_PERSONALITY with x86, which requires the ELF header
to select correct personality parameters, x86 requires the headers on
compat_start_thread() to choose starting CS for ELF32 binaries, instead of
relying on the going-away TIF_IA32/X32 flags.

Add an indirection macro to ELF invocations of START_THREAD, that x86 can
reimplement to receive the extra parameter just for ELF files. This
requires no changes to other architectures who don't need the header
information, they can continue to use the original start_thread for ELF and
non-ELF binaries, and it prevents affecting non-ELF code paths for x86.
Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201004032536.1229030-6-krisman@collabora.com

bc3d7bf6

time: Prevent undefined behaviour in timespec64_to_ns() · cb477557

由 Zeng Tao 提交于 9月 01, 2020

UBSAN reports:

Undefined behaviour in ./include/linux/time64.h:127:27
signed integer overflow:
17179869187 * 1000000000 cannot be represented in type 'long long int'
Call Trace:
 timespec64_to_ns include/linux/time64.h:127 [inline]
 set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180
 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245
 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336
 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295

Commit bd40a175 ("y2038: itimer: change implementation to timespec64")
replaced the original conversion which handled time clamping correctly with
timespec64_to_ns() which has no overflow protection.

Fix it in timespec64_to_ns() as this is not necessarily limited to the
usage in itimers.

[ tglx: Added comment and adjusted the fixes tag ]

Fixes: 361a3bf0 ("time64: Add time64.h header and define struct timespec64")
Signed-off-by: NZeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com

cb477557

treewide: Convert macro and uses of __section(foo) to __section("foo") · 33def849

由 Joe Perches 提交于 10月 21, 2020

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.plSigned-off-by: NJoe Perches <joe@perches.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: NMiguel Ojeda <ojeda@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33def849

mm: remove kzfree() compatibility definition · 23224e45

由 Eric Biggers 提交于 10月 23, 2020

Commit 453431a5 ("mm, treewide: rename kzfree() to
kfree_sensitive()") renamed kzfree() to kfree_sensitive(),
but it left a compatibility definition of kzfree() to avoid
being too disruptive.

Since then a few more instances of kzfree() have slipped in.

Just get rid of them and remove the compatibility definition
once and for all.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

23224e45

25 10月, 2020 2 次提交

random32: add noise from network and scheduling activity · 3744741a

由 Willy Tarreau 提交于 8月 10, 2020

With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.

This patch restores the spirit of commit f227e3ec ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.

The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: NWilly Tarreau <w@1wt.eu>

3744741a

random32: make prandom_u32() output unpredictable · c51f8f88

由 George Spelvin 提交于 8月 09, 2020

Non-cryptographic PRNGs may have great statistical properties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output.  An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.

It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack.  Oops.

This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key.  (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.)  Speed is prioritized over security;
attacks are rare, while performance is always wanted.

Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.

Commit f227e3ec ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution.  This patch replaces
it.
Reported-by: NAmit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Fixes: f227e3ec ("random32: update the net random state on interrupt and activity")
Signed-off-by: NGeorge Spelvin <lkml@sdf.org>
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
[ willy: partial reversal of f227e3ec; moved SIPROUND definitions
  to prandom.h for later use; merged George's prandom_seed() proposal;
  inlined siprand_u32(); replaced the net_rand_state[] array with 4
  members to fix a build issue; cosmetic cleanups to make checkpatch
  happy; fixed RANDOM32_SELFTEST build ]
Signed-off-by: NWilly Tarreau <w@1wt.eu>

c51f8f88

23 10月, 2020 1 次提交

vdpa: introduce config op to get valid iova range · 3f1b623a

由 Jason Wang 提交于 10月 23, 2020

This patch introduce a config op to get valid iova range from the vDPA
device.
Signed-off-by: NJason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/r/20201023090043.14430-2-jasowang@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>

3f1b623a

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功