提交 · cb44e4f061e16be65b8a16505e121490c66d30d0 · openeuler / Kernel

25 5月, 2022 1 次提交

lockdown: also lock down previous kgdb use · eadb2f47

由 Daniel Thompson 提交于 5月 23, 2022

KGDB and KDB allow read and write access to kernel memory, and thus
should be restricted during lockdown.  An attacker with access to a
serial port (for example, via a hypervisor console, which some cloud
vendors provide over the network) could trigger the debugger so it is
important that the debugger respect the lockdown mode when/if it is
triggered.

Fix this by integrating lockdown into kdb's existing permissions
mechanism.  Unfortunately kgdb does not have any permissions mechanism
(although it certainly could be added later) so, for now, kgdb is simply
and brutally disabled by immediately exiting the gdb stub without taking
any action.

For lockdowns established early in the boot (e.g. the normal case) then
this should be fine but on systems where kgdb has set breakpoints before
the lockdown is enacted than "bad things" will happen.

CVE: CVE-2022-21499
Co-developed-by: NStephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: NStephen Brennan <stephen.s.brennan@oracle.com>
Reviewed-by: NDouglas Anderson <dianders@chromium.org>
Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eadb2f47

23 5月, 2022 1 次提交

LSM: Remove double path_rename hook calls for RENAME_EXCHANGE · 100f59d9

由 Mickaël Salaün 提交于 5月 06, 2022

In order to be able to identify a file exchange with renameat2(2) and
RENAME_EXCHANGE, which will be useful for Landlock [1], propagate the
rename flags to LSMs.  This may also improve performance because of the
switch from two set of LSM hook calls to only one, and because LSMs
using this hook may optimize the double check (e.g. only one lock,
reduce the number of path walks).

AppArmor, Landlock and Tomoyo are updated to leverage this change.  This
should not change the current behavior (same check order), except
(different level of) speed boosts.

[1] https://lore.kernel.org/r/20220221212522.320243-1-mic@digikod.net

Cc: James Morris <jmorris@namei.org>
Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
Cc: Serge E. Hallyn <serge@hallyn.com>
Acked-by: NJohn Johansen <john.johansen@canonical.com>
Acked-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: NPaul Moore <paul@paul-moore.com>
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Link: https://lore.kernel.org/r/20220506161102.525323-7-mic@digikod.net

100f59d9

20 5月, 2022 1 次提交

topology: Remove unused cpu_cluster_mask() · 991d8d81

由 Dietmar Eggemann 提交于 5月 13, 2022

default_topology[] uses cpu_clustergroup_mask() for the CLS level
(guarded by CONFIG_SCHED_CLUSTER) which is currently provided by x86
(arch/x86/kernel/smpboot.c) and arm64 (drivers/base/arch_topology.c).

Fixes: 778c558f ("sched: Add cluster scheduler level in core and
related Kconfig for ARM64")
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NBarry Song <baohua@kernel.org>
Link: https://lore.kernel.org/r/20220513093433.425163-1-dietmar.eggemann@arm.com

991d8d81

19 5月, 2022 9 次提交

random: move randomize_page() into mm where it belongs · 5ad7dd88

由 Jason A. Donenfeld 提交于 5月 14, 2022

randomize_page is an mm function. It is documented like one. It contains
the history of one. It has the naming convention of one. It looks
just like another very similar function in mm, randomize_stack_top().
And it has always been maintained and updated by mm people. There is no
need for it to be in random.c. In the "which shape does not look like
the other ones" test, pointing to randomize_page() is correct.

So move randomize_page() into mm/util.c, right next to the similar
randomize_stack_top() function.

This commit contains no actual code changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

5ad7dd88

random: remove mostly unused async readiness notifier · 6701de6c

由 Jason A. Donenfeld 提交于 5月 15, 2022

The register_random_ready_notifier() notifier is somewhat complicated,
and was already recently rewritten to use notifier blocks. It is only
used now by one consumer in the kernel, vsprintf.c, for which the async
mechanism is really overly complex for what it actually needs. This
commit removes register_random_ready_notifier() and unregister_random_
ready_notifier(), because it just adds complication with little utility,
and changes vsprintf.c to just check on `!rng_is_initialized() &&
!rng_has_arch_random()`, which will eventually be true. Performance-
wise, that code was already using a static branch, so there's basically
no overhead at all to this change.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Petr Mladek <pmladek@suse.com> # for vsprintf.c
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

6701de6c

random: remove get_random_bytes_arch() and add rng_has_arch_random() · 248561ad

由 Jason A. Donenfeld 提交于 5月 14, 2022

The RNG incorporates RDRAND into its state at boot and every time it
reseeds, so there's no reason for callers to use it directly. The
hashing that the RNG does on it is preferable to using the bytes raw.

The only current use case of get_random_bytes_arch() is vsprintf's
siphash key for pointer hashing, which uses it to initialize the pointer
secret earlier than usual if RDRAND is available. In order to replace
this narrow use case, just expose whether RDRAND is mixed into the RNG,
with a new function called rng_has_arch_random(). With that taken care
of, there are no users of get_random_bytes_arch() left, so it can be
removed.

Later, if trust_cpu gets turned on by default (as most distros are
doing), this one use of rng_has_arch_random() can probably go away as
well.

Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Acked-by: Petr Mladek <pmladek@suse.com> # for vsprintf.c
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

248561ad

random: make consistent use of buf and len · a1940263

由 Jason A. Donenfeld 提交于 5月 13, 2022

The current code was a mix of "nbytes", "count", "size", "buffer", "in",
and so forth. Instead, let's clean this up by naming input parameters
"buf" (or "ubuf") and "len", so that you always understand that you're
reading this variety of function argument.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

a1940263

random: use proper return types on get_random_{int,long}_wait() · 7c3a8a1d

由 Jason A. Donenfeld 提交于 5月 13, 2022

Before these were returning signed values, but the API is intended to be
used with unsigned values.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

7c3a8a1d

random: remove extern from functions in header · 7782cfec

由 Jason A. Donenfeld 提交于 5月 13, 2022

Accoriding to the kernel style guide, having `extern` on functions in
headers is old school and deprecated, and doesn't add anything. So remove
them from random.h, and tidy up the file a little bit too.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

7782cfec

riscv/efi_stub: Add support for RISCV_EFI_BOOT_PROTOCOL · 3f68e695

由 Sunil V L 提交于 5月 19, 2022

Add support for getting the boot hart ID from the Linux EFI stub using
RISCV_EFI_BOOT_PROTOCOL. This method is preferred over the existing DT
based approach since it works irrespective of DT or ACPI.

The specification of the protocol is hosted at:
https://github.com/riscv-non-isa/riscv-uefiSigned-off-by: NSunil V L <sunilvl@ventanamicro.com>
Acked-by: NPalmer Dabbelt <palmer@rivosinc.com>
Reviewed-by: NHeinrich Schuchardt <heinrich.schuchardt@canonical.com>
Link: https://lore.kernel.org/r/20220519051512.136724-2-sunilvl@ventanamicro.com
[ardb: minor tweaks for coding style and whitespace]
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>

3f68e695

libceph: fix potential use-after-free on linger ping and resends · 75dbb685

由 Ilya Dryomov 提交于 5月 14, 2022

request_reinit() is not only ugly as the comment rightfully suggests,
but also unsafe.  Even though it is called with osdc->lock held for
write in all cases, resetting the OSD request refcount can still race
with handle_reply() and result in use-after-free.  Taking linger ping
as an example:

    handle_timeout thread                     handle_reply thread

                                              down_read(&osdc->lock)
                                              req = lookup_request(...)
                                              ...
                                              finish_request(req)  # unregisters
                                              up_read(&osdc->lock)
                                              __complete_request(req)
                                                linger_ping_cb(req)

      # req->r_kref == 2 because handle_reply still holds its ref

    down_write(&osdc->lock)
    send_linger_ping(lreq)
      req = lreq->ping_req  # same req
      # cancel_linger_request is NOT
      # called - handle_reply already
      # unregistered
      request_reinit(req)
        WARN_ON(req->r_kref != 1)  # fires
        request_init(req)
          kref_init(req->r_kref)

                   # req->r_kref == 1 after kref_init

                                              ceph_osdc_put_request(req)
                                                kref_put(req->r_kref)

            # req->r_kref == 0 after kref_put, req is freed

        <further req initialization/use> !!!

This happens because send_linger_ping() always (re)uses the same OSD
request for watch ping requests, relying on cancel_linger_request() to
unregister it from the OSD client and rip its messages out from the
messenger.  send_linger() does the same for watch/notify registration
and watch reconnect requests.  Unfortunately cancel_request() doesn't
guarantee that after it returns the OSD client would be completely done
with the OSD request -- a ref could still be held and the callback (if
specified) could still be invoked too.

The original motivation for request_reinit() was inability to deal with
allocation failures in send_linger() and send_linger_ping().  Switching
to using osdc->req_mempool (currently only used by CephFS) respects that
and allows us to get rid of request_reinit().

Cc: stable@vger.kernel.org
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Reviewed-by: NXiubo Li <xiubli@redhat.com>
Acked-by: NJeff Layton <jlayton@kernel.org>

75dbb685

nvme: add support for TP4084 - Time-to-Ready Enhancements · 354201c5

由 Christoph Hellwig 提交于 5月 16, 2022

Add support for using longer timeouts during controller initialization
and letting the controller come up with namespaces that are not ready
for I/O yet.  We skip these not ready namespaces during scanning and
only bring them online once anoter scan is kicked off by the AEN that
is set when the NRDY bit gets set in the  I/O Command Set Independent
Identify Namespace Data Structure.   This asynchronous probing avoids
blocking the kernel boot when controllers take a very long time to
recover after unclean shutdowns (up to minutes).
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>

354201c5

18 5月, 2022 5 次提交

random: handle latent entropy and command line from random_init() · 2f14062b

由 Jason A. Donenfeld 提交于 5月 05, 2022

Currently, start_kernel() adds latent entropy and the command line to
the entropy bool *after* the RNG has been initialized, deferring when
it's actually used by things like stack canaries until the next time
the pool is seeded. This surely is not intended.

Rather than splitting up which entropy gets added where and when between
start_kernel() and random_init(), just do everything in random_init(),
which should eliminate these kinds of bugs in the future.

While we're at it, rename the awkwardly titled "rand_initialize()" to
the more standard "random_init()" nomenclature.
Reviewed-by: NDominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

2f14062b

random32: use real rng for non-deterministic randomness · d4150779

由 Jason A. Donenfeld 提交于 5月 11, 2022

random32.c has two random number generators in it: one that is meant to
be used deterministically, with some predefined seed, and one that does
the same exact thing as random.c, except does it poorly. The first one
has some use cases. The second one no longer does and can be replaced
with calls to random.c's proper random number generator.

The relatively recent siphash-based bad random32.c code was added in
response to concerns that the prior random32.c was too deterministic.
Out of fears that random.c was (at the time) too slow, this code was
anonymously contributed. Then out of that emerged a kind of shadow
entropy gathering system, with its own tentacles throughout various net
code, added willy nilly.

Stop👏making👏bespoke👏random👏number👏generators👏.

Fortunately, recent advances in random.c mean that we can stop playing
with this sketchiness, and just use get_random_u32(), which is now fast
enough. In micro benchmarks using RDPMC, I'm seeing the same median
cycle count between the two functions, with the mean being _slightly_
higher due to batches refilling (which we can optimize further need be).
However, when doing *real* benchmarks of the net functions that actually
use these random numbers, the mean cycles actually *decreased* slightly
(with the median still staying the same), likely because the additional
prandom code means icache misses and complexity, whereas random.c is
generally already being used by something else nearby.

The biggest benefit of this is that there are many users of prandom who
probably should be using cryptographically secure random numbers. This
makes all of those accidental cases become secure by just flipping a
switch. Later on, we can do a tree-wide cleanup to remove the static
inline wrapper functions that this commit adds.

There are also some low-ish hanging fruits for making this even faster
in the future: a get_random_u16() function for use in the networking
stack will give a 2x performance boost there, using SIMD for ChaCha20
will let us compute 4 or 8 or 16 blocks of output in parallel, instead
of just one, giving us large buffers for cheap, and introducing a
get_random_*_bh() function that assumes irqs are already disabled will
shave off a few cycles for ordinary calls. These are things we can chip
away at down the road.
Acked-by: NJakub Kicinski <kuba@kernel.org>
Acked-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

d4150779

siphash: use one source of truth for siphash permutations · e73aaae2

由 Jason A. Donenfeld 提交于 5月 07, 2022

The SipHash family of permutations is currently used in three places:

- siphash.c itself, used in the ordinary way it was intended.
- random32.c, in a construction from an anonymous contributor.
- random.c, as part of its fast_mix function.

Each one of these places reinvents the wheel with the same C code, same
rotation constants, and same symmetry-breaking constants.

This commit tidies things up a bit by placing macros for the
permutations and constants into siphash.h, where each of the three .c
users can access them. It also leaves a note dissuading more users of
them from emerging.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

e73aaae2

locking/atomic: Add generic try_cmpxchg64 support · 0aa7be05

由 Uros Bizjak 提交于 5月 15, 2022

Add generic support for try_cmpxchg64{,_acquire,_release,_relaxed}
and their falbacks involving cmpxchg64.
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20220515184205.103089-2-ubizjak@gmail.com

0aa7be05

audit,io_uring,io-wq: call __audit_uring_exit for dummy contexts · 69e9cd66

由 Julian Orth 提交于 5月 17, 2022

Not calling the function for dummy contexts will cause the context to
not be reset. During the next syscall, this will cause an error in
__audit_syscall_entry:

	WARN_ON(context->context != AUDIT_CTX_UNUSED);
	WARN_ON(context->name_count);
	if (context->context != AUDIT_CTX_UNUSED || context->name_count) {
		audit_panic("unrecoverable error in audit_syscall_entry()");
		return;
	}

These problematic dummy contexts are created via the following call
chain:

       exit_to_user_mode_prepare
    -> arch_do_signal_or_restart
    -> get_signal
    -> task_work_run
    -> tctx_task_work
    -> io_req_task_submit
    -> io_issue_sqe
    -> audit_uring_entry

Cc: stable@vger.kernel.org
Fixes: 5bd2182d ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: NJulian Orth <ju.orth@gmail.com>
[PM: subject line tweaks]
Signed-off-by: NPaul Moore <paul@paul-moore.com>

69e9cd66

17 5月, 2022 1 次提交

nvme: split the enum used for various register constants · e626f37e

由 Christoph Hellwig 提交于 5月 16, 2022

Instead of having one big enum add one for each register or field.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>

e626f37e

16 5月, 2022 3 次提交

net: fix dev_fill_forward_path with pppoe + bridge · cf2df74e

由 Felix Fietkau 提交于 5月 09, 2022

When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx

Fixes: f6efc675 ("net: ppp: resolve forwarding path for bridge pppoe devices")
Signed-off-by: NFelix Fietkau <nbd@nbd.name>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>

cf2df74e

nvme: add missing status values to verbose logging · ca2d8992

由 Max Gurtovoy 提交于 4月 28, 2022

Log a few more path related status codes.
Signed-off-by: NMax Gurtovoy <mgurtovoy@nvidia.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ca2d8992

cdrom: remove the unused driver specific disc change ioctl · 03fea699

由 Paul Gortmaker 提交于 5月 15, 2022

This was only used by the ide-cd driver, which went away in
commit b7fb14d3 ("ide: remove the legacy ide driver")
so we might as well take advantage of that and get rid of
this hook as well.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Phillip Potter <phil@philpotter.co.uk>
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Link: https://lore.kernel.org/all/20220427132436.12795-2-paul.gortmaker@windriver.comSigned-off-by: NPhillip Potter <phil@philpotter.co.uk>
Link: https://lore.kernel.org/r/20220515205833.944139-3-phil@philpotter.co.ukSigned-off-by: NJens Axboe <axboe@kernel.dk>

03fea699

14 5月, 2022 2 次提交

timekeeping: Add raw clock fallback for random_get_entropy() · 1366992e

由 Jason A. Donenfeld 提交于 4月 10, 2022

The addition of random_get_entropy_fallback() provides access to
whichever time source has the highest frequency, which is useful for
gathering entropy on platforms without available cycle counters. It's
not necessarily as good as being able to quickly access a cycle counter
that the CPU has, but it's still something, even when it falls back to
being jiffies-based.

In the event that a given arch does not define get_cycles(), falling
back to the get_cycles() default implementation that returns 0 is really
not the best we can do. Instead, at least calling
random_get_entropy_fallback() would be preferable, because that always
needs to return _something_, even falling back to jiffies eventually.
It's not as though random_get_entropy_fallback() is super high precision
or guaranteed to be entropic, but basically anything that's not zero all
the time is better than returning zero all the time.

Finally, since random_get_entropy_fallback() is used during extremely
early boot when randomizing freelists in mm_init(), it can be called
before timekeeping has been initialized. In that case there really is
nothing we can do; jiffies hasn't even started ticking yet. So just give
up and return 0.
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Theodore Ts'o <tytso@mit.edu>

1366992e

security: declare member holding string literal const · 1af0e4a0

由 Christian Göttsche 提交于 2月 17, 2022

The struct security_hook_list member lsm is assigned in
security_add_hooks() with string literals passed from the individual
security modules.  Declare the function parameter and the struct member
const to signal their immutability.

Reported by Clang [-Wwrite-strings]:

    security/selinux/hooks.c:7388:63: error: passing 'const char [8]'
      to parameter of type 'char *' discards qualifiers
      [-Werror,-Wincompatible-pointer-types-discards-qualifiers]
            security_add_hooks(selinux_hooks,
                               ARRAY_SIZE(selinux_hooks), selinux);
                                                          ^~~~~~~~~
    ./include/linux/lsm_hooks.h:1629:11: note: passing argument to
      parameter 'lsm' here
                                    char *lsm);
                                          ^
Signed-off-by: NChristian Göttsche <cgzones@googlemail.com>
Reviewed-by: NPaul Moore <paul@paul-moore.com>
Reviewed-by: NCasey Schaufler <casey@schaufler-ca.com>
Signed-off-by: NPaul Moore <paul@paul-moore.com>

1af0e4a0

12 5月, 2022 3 次提交

stop_machine: Add stop_core_cpuslocked() for per-core operations · 2760f5a4

由 Peter Zijlstra 提交于 5月 06, 2022

Hardware core level testing features require near simultaneous execution
of WRMSR instructions on all threads of a core to initiate a test.

Provide a customized cut down version of stop_machine_cpuslocked() that
just operates on the threads of a single core.
Suggested-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220506225410.1652287-4-tony.luck@intel.comSigned-off-by: NHans de Goede <hdegoede@redhat.com>

2760f5a4

block: Fix the bio.bi_opf comment · 5d2ae142

由 Bart Van Assche 提交于 5月 11, 2022

Commit ef295ecf modified the Linux kernel such that the bottom bits
of the bi_opf member contain the operation instead of the topmost bits.
That commit did not update the comment next to bi_opf. Hence this patch.

From commit ef295ecf:
-#define bio_op(bio)    ((bio)->bi_opf >> BIO_OP_SHIFT)
+#define bio_op(bio)    ((bio)->bi_opf & REQ_OP_MASK)

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Fixes: ef295ecf ("block: better op and flags encoding")
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511235152.1082246-1-bvanassche@acm.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

5d2ae142

block: reorder the REQ_ flags · 5ce7729f

由 Christoph Hellwig 提交于 5月 12, 2022

Keep the op-specific flag last so that they are clearly separate from
the generic flags. Various recent commits just kept adding new flags
at the end.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220512061408.1826595-1-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5ce7729f

11 5月, 2022 4 次提交

platform_data/mlxreg: Add field for notification callback · abcebcd3

由 Michael Shych 提交于 4月 30, 2022

Add notification callback to inform caller that platform driver probing
has been completed. It allows to caller to perform some initialization
flow steps depending on specific driver probing completion.
Signed-off-by: NMichael Shych <michaelsh@nvidia.com>
Reviewed-by: NVadim Pasternak <vadimp@nvidia.com>
Link: https://lore.kernel.org/r/20220430115809.54565-2-michaelsh@nvidia.comSigned-off-by: NHans de Goede <hdegoede@redhat.com>

abcebcd3

lockdep: Delete local_irq_enable_in_hardirq() · deaf7c4b

由 Thomas Gleixner 提交于 5月 09, 2022

No more users and there is no desire to grow new ones.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8735hir0j4.ffs@tglx

deaf7c4b

fs,io_uring: add infrastructure for uring-cmd · ee692a21

由 Jens Axboe 提交于 5月 11, 2022

file_operations->uring_cmd is a file private handler.
This is somewhat similar to ioctl but hopefully a lot more sane and
useful as it can be used to enable many io_uring capabilities for the
underlying operation.

IORING_OP_URING_CMD is a file private kind of request. io_uring doesn't
know what is in this command type, it's for the provider of ->uring_cmd()
to deal with.
Co-developed-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220511054750.20432-2-joshi.k@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ee692a21

randomize_kstack: Improve docs on requirements/rationale · 1ff29758

由 Kees Cook 提交于 5月 10, 2022

There were some recent questions about where and why to use the
random_kstack routines when applying them to new architectures[1].
Update the header comments to reflect the design choices for the
routines.

[1] https://lore.kernel.org/lkml/1652173338.7bltwybi0c.astroid@bobo.none

Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Xiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

1ff29758

10 5月, 2022 1 次提交

fscrypt: add new helper functions for test_dummy_encryption · 218d921b

由 Eric Biggers 提交于 4月 30, 2022

Unfortunately the design of fscrypt_set_test_dummy_encryption() doesn't
work properly for the new mount API, as it combines too many steps into
one function:

  - Parse the argument to test_dummy_encryption
  - Check the setting against the filesystem instance
  - Apply the setting to the filesystem instance

The new mount API has split these into separate steps.  ext4 partially
worked around this by duplicating some of the logic, but it still had
some bugs.  To address this, add some new helper functions that split up
the steps of fscrypt_set_test_dummy_encryption():

  - fscrypt_parse_test_dummy_encryption()
  - fscrypt_dummy_policies_equal()
  - fscrypt_add_test_dummy_key()

While we're add it, also add a function fscrypt_is_dummy_policy_set()
which will be useful to avoid some #ifdef's.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20220501050857.538984-5-ebiggers@kernel.org

218d921b

09 5月, 2022 2 次提交

entry: Rename arch_check_user_regs() to arch_enter_from_user_mode() · 6d97af48

由 Sven Schnelle 提交于 5月 04, 2022

arch_check_user_regs() is used at the moment to verify that struct pt_regs
contains valid values when entering the kernel from userspace. s390 needs
a place in the generic entry code to modify a cpu data structure when
switching from userspace to kernel mode. As arch_check_user_regs() is
exactly this, rename it to arch_enter_from_user_mode().

When entering the kernel from userspace, arch_check_user_regs() is
used to verify that struct pt_regs contains valid values. Note that
the NMI codepath doesn't call this function. s390 needs a place in the
generic entry code to modify a cpu data structure when switching from
userspace to kernel mode. As arch_check_user_regs() is exactly this,
rename it to arch_enter_from_user_mode().
Signed-off-by: NSven Schnelle <svens@linux.ibm.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Link: https://lore.kernel.org/r/20220504062351.2954280-2-tmricht@linux.ibm.comSigned-off-by: NHeiko Carstens <hca@linux.ibm.com>

6d97af48

blk-mq: remove the error_count from struct request · 2e3afb42

由 Willy Tarreau 提交于 5月 08, 2022

The last two users were floppy.c and ataflop.c respectively, it was
verified that no other drivers makes use of this, so let's remove it.
Suggested-by: NLinus Torvalds <torvalds@linuxfoundation.org>
Cc: Minh Yuan <yuanmingbuaa@gmail.com>
Cc: Denis Efremov <efremov@linux.com>,
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2e3afb42

08 5月, 2022 7 次提交

stackleak: rework poison scanning · 77cf2b6d

由 Mark Rutland 提交于 4月 27, 2022

Currently we over-estimate the region of stack which must be erased.

To determine the region to be erased, we scan downwards for a contiguous
block of poison values (or the low bound of the stack). There are a few
minor problems with this today:

* When we find a block of poison values, we include this block within
  the region to erase.

  As this is included within the region to erase, this causes us to
  redundantly overwrite 'STACKLEAK_SEARCH_DEPTH' (128) bytes with
  poison.

* As the loop condition checks 'poison_count <= depth', it will run an
  additional iteration after finding the contiguous block of poison,
  decrementing 'erase_low' once more than necessary.

  As this is included within the region to erase, this causes us to
  redundantly overwrite an additional unsigned long with poison.

* As we always decrement 'erase_low' after checking an element on the
  stack, we always include the element below this within the region to
  erase.

  As this is included within the region to erase, this causes us to
  redundantly overwrite an additional unsigned long with poison.

  Note that this is not a functional problem. As the loop condition
  checks 'erase_low > task_stack_low', we'll never clobber the
  STACK_END_MAGIC. As we always decrement 'erase_low' after this, we'll
  never fail to erase the element immediately above the STACK_END_MAGIC.

In total, this can cause us to erase `128 + 2 * sizeof(unsigned long)`
bytes more than necessary, which is unfortunate.

This patch reworks the logic to find the address immediately above the
poisoned region, by finding the lowest non-poisoned address. This is
factored into a stackleak_find_top_of_poison() helper both for clarity
and so that this can be shared with the LKDTM test in subsequent
patches.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220427173128.2603085-8-mark.rutland@arm.com

77cf2b6d

stackleak: rework stack high bound handling · 0cfa2ccd

由 Mark Rutland 提交于 4月 27, 2022

Prior to returning to userspace, we reset current->lowest_stack to a
reasonable high bound. Currently we do this by subtracting the arbitrary
value `THREAD_SIZE/64` from the top of the stack, for reasons lost to
history.

Looking at configurations today:

* On i386 where THREAD_SIZE is 8K, the bound will be 128 bytes. The
  pt_regs at the top of the stack is 68 bytes (with 0 to 16 bytes of
  padding above), and so this covers an additional portion of 44 to 60
  bytes.

* On x86_64 where THREAD_SIZE is at least 16K (up to 32K with KASAN) the
  bound will be at least 256 bytes (up to 512 with KASAN). The pt_regs
  at the top of the stack is 168 bytes, and so this cover an additional
  88 bytes of stack (up to 344 with KASAN).

* On arm64 where THREAD_SIZE is at least 16K (up to 64K with 64K pages
  and VMAP_STACK), the bound will be at least 256 bytes (up to 1024 with
  KASAN). The pt_regs at the top of the stack is 336 bytes, so this can
  fall within the pt_regs, or can cover an additional 688 bytes of
  stack.

Clearly the `THREAD_SIZE/64` value doesn't make much sense -- in the
worst case, this will cause more than 600 bytes of stack to be erased
for every syscall, even if actual stack usage were substantially
smaller.

This patches makes this slightly less nonsensical by consistently
resetting current->lowest_stack to the base of the task pt_regs. For
clarity and for consistency with the handling of the low bound, the
generation of the high bound is split into a helper with commentary
explaining why.

Since the pt_regs at the top of the stack will be clobbered upon the
next exception entry, we don't need to poison these at exception exit.
By using task_pt_regs() as the high stack boundary instead of
current_top_of_stack() we avoid some redundant poisoning, and the
compiler can share the address generation between the poisoning and
resetting of `current->lowest_stack`, making the generated code more
optimal.

It's not clear to me whether the existing `THREAD_SIZE/64` offset was a
dodgy heuristic to skip the pt_regs, or whether it was attempting to
minimize the number of times stackleak_check_stack() would have to
update `current->lowest_stack` when stack usage was shallow at the cost
of unconditionally poisoning a small portion of the stack for every exit
to userspace.

For now I've simply removed the offset, and if we need/want to minimize
updates for shallow stack usage it should be easy to add a better
heuristic atop, with appropriate commentary so we know what's going on.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220427173128.2603085-7-mark.rutland@arm.com

0cfa2ccd

stackleak: rework stack low bound handling · 9ec79840

由 Mark Rutland 提交于 4月 27, 2022

In stackleak_task_init(), stackleak_track_stack(), and
__stackleak_erase(), we open-code skipping the STACK_END_MAGIC at the
bottom of the stack. Each case is implemented slightly differently, and
only the __stackleak_erase() case is commented.

In stackleak_task_init() and stackleak_track_stack() we unconditionally
add sizeof(unsigned long) to the lowest stack address. In
stackleak_task_init() we use end_of_stack() for this, and in
stackleak_track_stack() we use task_stack_page(). In __stackleak_erase()
we handle this by detecting if `kstack_ptr` has hit the stack end
boundary, and if so, conditionally moving it above the magic.

This patch adds a new stackleak_task_low_bound() helper which is used in
all three cases, which unconditionally adds sizeof(unsigned long) to the
lowest address on the task stack, with commentary as to why. This uses
end_of_stack() as stackleak_task_init() did prior to this patch, as this
is consistent with the code in kernel/fork.c which initializes the
STACK_END_MAGIC value.

In __stackleak_erase() we no longer need to check whether we've spilled
into the STACK_END_MAGIC value, as stackleak_track_stack() ensures that
`current->lowest_stack` stops immediately above this, and similarly the
poison scan will stop immediately above this.

For stackleak_task_init() and stackleak_track_stack() this results in no
change to code generation. For __stackleak_erase() the generated
assembly is slightly simpler and shorter.
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Alexander Popov <alex.popov@linux.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220427173128.2603085-5-mark.rutland@arm.com

9ec79840

randstruct: Move seed generation into scripts/basic/ · be2b34fa

由 Kees Cook 提交于 5月 03, 2022

To enable Clang randstruct support, move the structure layout
randomization seed generation out of scripts/gcc-plugins/ into
scripts/basic/ so it happens early enough that it can be used by either
compiler implementation. The gcc-plugin still builds its own header file,
but now does so from the common "randstruct.seed" file.

Cc: linux-hardening@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-6-keescook@chromium.org

be2b34fa

randstruct: Reorganize Kconfigs and attribute macros · 595b893e

由 Kees Cook 提交于 5月 03, 2022

In preparation for Clang supporting randstruct, reorganize the Kconfigs,
move the attribute macros, and generalize the feature to be named
CONFIG_RANDSTRUCT for on/off, CONFIG_RANDSTRUCT_FULL for the full
randomization mode, and CONFIG_RANDSTRUCT_PERFORMANCE for the cache-line
sized mode.

Cc: linux-hardening@vger.kernel.org
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-4-keescook@chromium.org

595b893e

netfs: Eliminate Clang randstruct warning · 3b5eed3c

由 Kees Cook 提交于 5月 03, 2022

Clang's structure layout randomization feature gets upset when it sees
struct inode (which is randomized) cast to struct netfs_i_context. This
is due to seeing the inode pointer as being treated as an array of inodes,
rather than "something else, following struct inode".

Since netfs can't use container_of() (since it doesn't know what the
true containing struct is), it uses this direct offset instead. Adjust
the code to better reflect what is happening: an arbitrary pointer is
being adjusted and cast to something else: use a "void *" for the math.
The resulting binary output is the same, but Clang no longer sees an
unexpected cross-structure cast:

In file included from ../fs/nfs/inode.c:50:
In file included from ../fs/nfs/fscache.h:15:
In file included from ../include/linux/fscache.h:18:
../include/linux/netfs.h:298:9: error: casting from randomized structure pointer type 'struct inode *' to 'struct netfs_i_context *'
        return (struct netfs_i_context *)(inode + 1);
               ^
1 error generated.

Cc: David Howells <dhowells@redhat.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220503205503.3054173-2-keescook@chromium.orgReviewed-by: NJeff Layton <jlayton@kernel.org>
Link: https://lore.kernel.org/lkml/7562f8eccd7cc0e447becfe9912179088784e3b9.camel@kernel.org

3b5eed3c

SUNRPC: Ensure that the gssproxy client can start in a connected state · fd13359f

由 Trond Myklebust 提交于 5月 07, 2022

Ensure that the gssproxy client connects to the server from the gssproxy
daemon process context so that the AF_LOCAL socket connection is done
using the correct path and namespaces.

Fixes: 1d658336 ("SUNRPC: Add RPC based upcall mechanism for RPCGSS auth")
Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

fd13359f

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功