提交 · 257bf26ef10399376b5140aa867b1eee21058ed8 · openeuler / Kernel

12 11月, 2021 37 次提交

userfaultfd: prevent concurrent API initialization · 257bf26e

由 Nadav Amit 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit e826df5864e449f89c2d056aab998c813d40c932

--------------------------------

[ Upstream commit 22e5fe2a ]

userfaultfd assumes that the enabled features are set once and never
changed after UFFDIO_API ioctl succeeded.

However, currently, UFFDIO_API can be called concurrently from two
different threads, succeed on both threads and leave userfaultfd's
features in non-deterministic state.  Theoretically, other uffd operations
(ioctl's and page-faults) can be dispatched while adversely affected by
such changes of features.

Moreover, the writes to ctx->state and ctx->features are not ordered,
which can - theoretically, again - let userfaultfd_ioctl() think that
userfaultfd API completed, while the features are still not initialized.

To avoid races, it is arguably best to get rid of ctx->state.  Since there
are only 2 states, record the API initialization in ctx->features as the
uppermost bit and remove ctx->state.

Link: https://lkml.kernel.org/r/20210808020724.1022515-3-namit@vmware.com
Fixes: 9cd75c3c ("userfaultfd: non-cooperative: add ability to report non-PF events from uffd descriptor")
Signed-off-by: NNadav Amit <namit@vmware.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

257bf26e

PCI: Return ~0 data on pciconfig_read() CAP_SYS_ADMIN failure · 264c05e2

由 Krzysztof Wilczyński 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 60242386c7ef9b084fc138e3d2b7bb99042fd779

--------------------------------

commit a8bd29bd upstream.

The pciconfig_read() syscall reads PCI configuration space using
hardware-dependent config accessors.

If the read fails on PCI, most accessors don't return an error; they
pretend the read was successful and got ~0 data from the device, so the
syscall returns success with ~0 data in the buffer.

When the accessor does return an error, pciconfig_read() normally fills the
user's buffer with ~0 and returns an error in errno.  But after
e4585da2 ("pci syscall.c: Switch to refcounting API"), we don't fill
the buffer with ~0 for the EPERM "user lacks CAP_SYS_ADMIN" error.

Userspace may rely on the ~0 data to detect errors, but after e4585da2,
that would not detect CAP_SYS_ADMIN errors.

Restore the original behaviour of filling the buffer with ~0 when the
CAP_SYS_ADMIN check fails.

[bhelgaas: commit log, fold in Nathan's fix
https://lore.kernel.org/r/20210803200836.500658-1-nathan@kernel.org]
Fixes: e4585da2 ("pci syscall.c: Switch to refcounting API")
Link: https://lore.kernel.org/r/20210729233755.1509616-1-kw@linux.comSigned-off-by: NKrzysztof Wilczyński <kw@linux.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

264c05e2

block: bfq: fix bfq_set_next_ioprio_data() · 41b4ec95

由 Damien Le Moal 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 35a276f173356448cd07d83238d2bbc30796b267

--------------------------------

commit a680dd72 upstream.

For a request that has a priority level equal to or larger than
IOPRIO_BE_NR, bfq_set_next_ioprio_data() prints a critical warning but
defaults to setting the request new_ioprio field to IOPRIO_BE_NR. This
is not consistent with the warning and the allowed values for priority
levels. Fix this by setting the request new_ioprio field to
IOPRIO_BE_NR - 1, the lowest priority level allowed.

Cc: <stable@vger.kernel.org>
Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20210811033702.368488-2-damien.lemoal@wdc.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

41b4ec95

arm64: head: avoid over-mapping in map_memory · 35fab104

由 Mark Rutland 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 79d15fc5646f79e769b90957370a84c0ba981c85

--------------------------------

commit 90268574 upstream.

The `compute_indices` and `populate_entries` macros operate on inclusive
bounds, and thus the `map_memory` macro which uses them also operates
on inclusive bounds.

We pass `_end` and `_idmap_text_end` to `map_memory`, but these are
exclusive bounds, and if one of these is sufficiently aligned (as a
result of kernel configuration, physical placement, and KASLR), then:

* In `compute_indices`, the computed `iend` will be in the page/block *after*
  the final byte of the intended mapping.

* In `populate_entries`, an unnecessary entry will be created at the end
  of each level of table. At the leaf level, this entry will map up to
  SWAPPER_BLOCK_SIZE bytes of physical addresses that we did not intend
  to map.

As we may map up to SWAPPER_BLOCK_SIZE bytes more than intended, we may
violate the boot protocol and map physical address past the 2MiB-aligned
end address we are permitted to map. As we map these with Normal memory
attributes, this may result in further problems depending on what these
physical addresses correspond to.

The final entry at each level may require an additional table at that
level. As EARLY_ENTRIES() calculates an inclusive bound, we allocate
enough memory for this.

Avoid the extraneous mapping by having map_memory convert the exclusive
end address to an inclusive end address by subtracting one, and do
likewise in EARLY_ENTRIES() when calculating the number of required
tables. For clarity, comments are updated to more clearly document which
boundaries the macros operate on.  For consistency with the other
macros, the comments in map_memory are also updated to describe `vstart`
and `vend` as virtual addresses.

Fixes: 0370b31e ("arm64: Extend early page table code to allow for larger kernels")
Cc: <stable@vger.kernel.org> # 4.16.x
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Steve Capper <steve.capper@arm.com>
Cc: Will Deacon <will@kernel.org>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210823101253.55567-1-mark.rutland@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

35fab104

bpf: Fix pointer arithmetic mask tightening under state pruning · 31f34e09

由 Daniel Borkmann 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit a386fceed74e9f1125104eea7e66cff187edeff7

--------------------------------

commit e042aa53 upstream.

In 7fedb63a ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.

The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a.

One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
[OP: adjusted context for 4.19]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

31f34e09

bpf: verifier: Allocate idmap scratch in verifier env · 4b498490

由 Lorenz Bauer 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit a4fe956b03316516ce0eca037d480d270b9bed4c

--------------------------------

commit c9e73e3d upstream.

func_states_equal makes a very short lived allocation for idmap,
probably because it's too large to fit on the stack. However the
function is called quite often, leading to a lot of alloc / free
churn. Replace the temporary allocation with dedicated scratch
space in struct bpf_verifier_env.
Signed-off-by: NLorenz Bauer <lmb@cloudflare.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NEdward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/bpf/20210429134656.122225-4-lmb@cloudflare.com
[OP: adjusted context for 4.19]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4b498490

selftests/bpf: fix tests due to const spill/fill · d15e8e49

由 Alexei Starovoitov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit fc578c64398df4f93fd7a6143e218281cc7c4348

--------------------------------

commit fc559a70 upstream.

fix tests that incorrectly assumed that the verifier
cannot track constants through stack.
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
[OP: backport to 4.19]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d15e8e49

selftests/bpf: Test variable offset stack access · 55219307

由 Andrey Ignatov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 2a28675d4b5cff2bfa0b3475c53edba1f5ec8401

--------------------------------

commit 8ff80e96 upstream.

Test different scenarios of indirect variable-offset stack access: out of
bound access (>0), min_off below initialized part of the stack,
max_off+size above initialized part of the stack, initialized stack.

Example of output:
  ...
  #856/p indirect variable-offset stack access, out of bound OK
  #857/p indirect variable-offset stack access, max_off+size > max_initialized OK
  #858/p indirect variable-offset stack access, min_off < min_initialized OK
  #859/p indirect variable-offset stack access, ok OK
  ...
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
[OP: backport to 4.19]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

55219307

bpf: Sanity check max value for var_off stack access · e0b7ecca

由 Andrey Ignatov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 7667818ef188832f69e2cf9cfed56e03a8f7436b

--------------------------------

stable inclusion
from linux-4.19.207
commit 7667818ef188832f69e2cf9cfed56e03a8f7436b

--------------------------------

commit 107c26a7 upstream.

As discussed in [1] max value of variable offset has to be checked for
overflow on stack access otherwise verifier would accept code like this:

  0: (b7) r2 = 6
  1: (b7) r3 = 28
  2: (7a) *(u64 *)(r10 -16) = 0
  3: (7a) *(u64 *)(r10 -8) = 0
  4: (79) r4 = *(u64 *)(r1 +168)
  5: (c5) if r4 s< 0x0 goto pc+4
   R1=ctx(id=0,off=0,imm=0) R2=inv6 R3=inv28
   R4=inv(id=0,umax_value=9223372036854775807,var_off=(0x0;
   0x7fffffffffffffff)) R10=fp0,call_-1 fp-8=mmmmmmmm fp-16=mmmmmmmm
  6: (17) r4 -= 16
  7: (0f) r4 += r10
  8: (b7) r5 = 8
  9: (85) call bpf_getsockopt#57
  10: (b7) r0 = 0
  11: (95) exit

, where R4 obviosly has unbounded max value.

Fix it by checking that reg->smax_value is inside (-BPF_MAX_VAR_OFF;
BPF_MAX_VAR_OFF) range.

reg->smax_value is used instead of reg->umax_value because stack
pointers are calculated using negative offset from fp. This is opposite
to e.g. map access where offset must be non-negative and where
umax_value is used.

Also dedicated verbose logs are added for both min and max bound check
failures to have diagnostics consistent with variable offset handling in
check_map_access().

[1] https://marc.info/?l=linux-netdev&m=155433357510597&w=2

Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e0b7ecca

bpf: Reject indirect var_off stack access in unpriv mode · ce768c92

由 Andrey Ignatov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 14cf676ba6a0fb5e495baf843d750070f627d7e8

--------------------------------

commit 088ec26d upstream.

Proper support of indirect stack access with variable offset in
unprivileged mode (!root) requires corresponding support in Spectre
masking for stack ALU in retrieve_ptr_limit().

There are no use-case for variable offset in unprivileged mode though so
make verifier reject such accesses for simplicity.

Pointer arithmetics is one (and only?) way to cause variable offset and
it's already rejected in unpriv mode so that verifier won't even get to
helper function whose argument contains variable offset, e.g.:

  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  variable stack access var_off=(0xfffffffffffffff0; 0x4) off=-16 size=1R2
  stack pointer arithmetic goes out of range, prohibited for !root

Still it looks like a good idea to reject variable offset indirect stack
access for unprivileged mode in check_stack_boundary() explicitly.

Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
[OP: drop comment in retrieve_ptr_limit()]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ce768c92

bpf: Reject indirect var_off stack access in raw mode · 0220ff68

由 Andrey Ignatov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 148339cb18ed38a4b1de2269d1a4227fd51376ef

--------------------------------

commit f2bcd05e upstream.

It's hard to guarantee that whole memory is marked as initialized on
helper return if uninitialized stack is accessed with variable offset
since specific bounds are unknown to verifier. This may cause
uninitialized stack leaking.

Reject such an access in check_stack_boundary to prevent possible
leaking.

There are no known use-cases for indirect uninitialized stack access
with variable offset so it shouldn't break anything.

Fixes: 2011fccf ("bpf: Support variable offset stack access from helpers")
Reported-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0220ff68

bpf: Support variable offset stack access from helpers · d8d61987

由 Andrey Ignatov 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit caee4103f09e3dcce2ec0d0faf6ff245a6cffbad

--------------------------------

commit 2011fccf upstream.

Currently there is a difference in how verifier checks memory access for
helper arguments for PTR_TO_MAP_VALUE and PTR_TO_STACK with regard to
variable part of offset.

check_map_access, that is used for PTR_TO_MAP_VALUE, can handle variable
offsets just fine, so that BPF program can call a helper like this:

  some_helper(map_value_ptr + off, size);

, where offset is unknown at load time, but is checked by program to be
in a safe rage (off >= 0 && off + size < map_value_size).

But it's not the case for check_stack_boundary, that is used for
PTR_TO_STACK, and same code with pointer to stack is rejected by
verifier:

  some_helper(stack_value_ptr + off, size);

For example:
  0: (7a) *(u64 *)(r10 -16) = 0
  1: (7a) *(u64 *)(r10 -8) = 0
  2: (61) r2 = *(u32 *)(r1 +0)
  3: (57) r2 &= 4
  4: (17) r2 -= 16
  5: (0f) r2 += r10
  6: (18) r1 = 0xffff888111343a80
  8: (85) call bpf_map_lookup_elem#1
  invalid variable stack read R2 var_off=(0xfffffffffffffff0; 0x4)

Add support for variable offset access to check_stack_boundary so that
if offset is checked by program to be in a safe range it's accepted by
verifier.
Signed-off-by: NAndrey Ignatov <rdna@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
[OP: replace reg_state(env, regno) helper with "cur_regs(env) + regno"]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Conflicts:
  kernel/bpf/verifier.c
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d8d61987

bpf: correct slot_type marking logic to allow more stack slot sharing · a171f137

由 Jiong Wang 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 79aba0ac3df1a604e843780b17c37646e175b4f8

--------------------------------

commit 0bae2d4d upstream.

Verifier is supposed to support sharing stack slot allocated to ptr with
SCALAR_VALUE for privileged program. However this doesn't happen for some
cases.

The reason is verifier is not clearing slot_type STACK_SPILL for all bytes,
it only clears part of them, while verifier is using:

  slot_type[0] == STACK_SPILL

as a convention to check one slot is ptr type.

So, the consequence of partial clearing slot_type is verifier could treat a
partially overridden ptr slot, which should now be a SCALAR_VALUE slot,
still as ptr slot, and rejects some valid programs.

Before this patch, test_xdp_noinline.o under bpf selftests, bpf_lxc.o and
bpf_netdev.o under Cilium bpf repo, when built with -mattr=+alu32 are
rejected due to this issue. After this patch, they all accepted.

There is no processed insn number change before and after this patch on
Cilium bpf programs.
Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
Reviewed-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
[OP: adjusted context for 4.19]
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a171f137

PCI/MSI: Skip masking MSI-X on Xen PV · 23536a74

由 Marek Marczykowski-Górecki 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 34b23fc32c05f14fd21caa32544f3720e1527a8d

--------------------------------

commit 1a519dc7 upstream.

When running as Xen PV guest, masking MSI-X is a responsibility of the
hypervisor. The guest has no write access to the relevant BAR at all - when
it tries to, it results in a crash like this:

    BUG: unable to handle page fault for address: ffffc9004069100c
    #PF: supervisor write access in kernel mode
    #PF: error_code(0x0003) - permissions violation
    RIP: e030:__pci_enable_msix_range.part.0+0x26b/0x5f0
     e1000e_set_interrupt_capability+0xbf/0xd0 [e1000e]
     e1000_probe+0x41f/0xdb0 [e1000e]
     local_pci_probe+0x42/0x80
    (...)

The recently introduced function msix_mask_all() does not check the global
variable pci_msi_ignore_mask which is set by XEN PV to bypass the masking
of MSI[-X] interrupts.

Add the check to make this function XEN PV compatible.

Fixes: 7d5ec3d3 ("PCI/MSI: Mask all unused MSI-X entries")
Signed-off-by: NMarek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210826170342.135172-1-marmarek@invisiblethingslab.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

23536a74

tty: Fix data race between tiocsti() and flush_to_ldisc() · 47baf1c5

由 Nguyen Dinh Phi 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit f7bffefa322a3d5a292c0b7a9b93302b392928f6

--------------------------------

commit bb2853a6 upstream.

The ops->receive_buf() may be accessed concurrently from these two
functions.  If the driver flushes data to the line discipline
receive_buf() method while tiocsti() is waiting for the
ops->receive_buf() to finish its work, the data race will happen.

For example:
tty_ioctl			|tty_ldisc_receive_buf
 ->tioctsi			| ->tty_port_default_receive_buf
				|  ->tty_ldisc_receive_buf
   ->hci_uart_tty_receive	|   ->hci_uart_tty_receive
    ->h4_recv                   |    ->h4_recv

In this case, the h4 receive buffer will be overwritten by the
latecomer, and we will lost the data.

Hence, change tioctsi() function to use the exclusive lock interface
from tty_buffer to avoid the data race.

Reported-by: syzbot+97388eb9d31b997fe1d0@syzkaller.appspotmail.com
Reviewed-by: NJiri Slaby <jirislaby@kernel.org>
Signed-off-by: NNguyen Dinh Phi <phind.uet@gmail.com>
Link: https://lore.kernel.org/r/20210823000641.2082292-1-phind.uet@gmail.com
Cc: stable <stable@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

47baf1c5

net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed · ee6324bb

由 Xiyu Yang 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 5129766a9797ea212087314fafc68268f1bf8515

--------------------------------

[ Upstream commit c6607012 ]

The reference counting issue happens in one exception handling path of
cbq_change_class(). When failing to get tcf_block, the function forgets
to decrease the refcount of "rtab" increased by qdisc_put_rtab(),
causing a refcount leak.

Fix this issue by jumping to "failure" label when get tcf_block failed.

Fixes: 6529eaba ("net: sched: introduce tcf block infractructure")
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Reviewed-by: NCong Wang <cong.wang@bytedance.com>
Link: https://lore.kernel.org/r/1630252681-71588-1-git-send-email-xiyuyang19@fudan.edu.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ee6324bb

tty: serial: fsl_lpuart: fix the wrong mapbase value · bda78699

由 Andy Duan 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 600624ecbe35e0359806b107f247811b1f79ad25

--------------------------------

[ Upstream commit d5c38948 ]

Register offset needs to be applied on mapbase also.
dma_tx/rx_request use the physical address of UARTDATA.
Register offset is currently only applied to membase (the
corresponding virtual addr) but not on mapbase.

Fixes: 24b1e5f0 ("tty: serial: lpuart: add imx7ulp support")
Reviewed-by: NLeonard Crestez <leonard.crestez@nxp.com>
Signed-off-by: NAdriana Reus <adriana.reus@nxp.com>
Signed-off-by: NSherry Sun <sherry.sun@nxp.com>
Signed-off-by: NAndy Duan <fugang.duan@nxp.com>
Link: https://lore.kernel.org/r/20210819021033.32606-1-sherry.sun@nxp.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bda78699

CIFS: Fix a potencially linear read overflow · 9d46c552

由 Len Baker 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit bea655491daf39f1934a71bf576bf3499092d3a4

--------------------------------

[ Upstream commit f980d055 ]

strlcpy() reads the entire source buffer first. This read may exceed the
destination size limit. This is both inefficient and can lead to linear
read overflows if a source string is not NUL-terminated.

Also, the strnlen() call does not avoid the read overflow in the strlcpy
function when a not NUL-terminated string is passed.

So, replace this block by a call to kstrndup() that avoids this type of
overflow and does the same.

Fixes: 066ce689 ("cifs: rename cifs_strlcpy_to_host and make it use new functions")
Signed-off-by: NLen Baker <len.baker@gmx.com>
Reviewed-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9d46c552

PCI: PM: Enable PME if it can be signaled from D3cold · 17e4b893

由 Rafael J. Wysocki 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit d4bcd29b50d1a33111d3bacd7974f0d0bbec484f

--------------------------------

[ Upstream commit 0e00392a ]

PME signaling is only enabled by __pci_enable_wake() if the target
device can signal PME from the given target power state (to avoid
pointless reconfiguration of the device), but if the hierarchy above
the device goes into D3cold, the device itself will end up in D3cold
too, so if it can signal PME from D3cold, it should be enabled to
do so in __pci_enable_wake().

[Note that if the device does not end up in D3cold and it cannot
 signal PME from the original target power state, it will not signal
 PME, so in that case the behavior does not change.]

Link: https://lore.kernel.org/linux-pm/3149540.aeNJFYEL58@kreacher/
Fixes: 5bcc2fb4 ("PCI PM: Simplify PCI wake-up code")
Reported-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Reported-by: NUtkarsh H Patel <utkarsh.h.patel@intel.com>
Reported-by: NKoba Ko <koba.ko@canonical.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

17e4b893

PCI: PM: Avoid forcing PCI_D0 for wakeup reasons inconsistently · 4f78495b

由 Rafael J. Wysocki 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit c5c62f4c936407fb734cae64700f20b68273b059

--------------------------------

[ Upstream commit da9f2150 ]

It is inconsistent to return PCI_D0 from pci_target_state() instead
of the original target state if 'wakeup' is true and the device
cannot signal PME from D0.

This only happens when the device cannot signal PME from the original
target state and any shallower power states (including D0) and that
case is effectively equivalent to the one in which PME singaling is
not supported at all.  Since the original target state is returned in
the latter case, make the function do that in the former one too.

Link: https://lore.kernel.org/linux-pm/3149540.aeNJFYEL58@kreacher/
Fixes: 666ff6f8 ("PCI/PM: Avoid using device_may_wakeup() for runtime PM")
Reported-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Reported-by: NUtkarsh H Patel <utkarsh.h.patel@intel.com>
Reported-by: NKoba Ko <koba.ko@canonical.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Tested-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4f78495b

tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos · 6b40e3fa

由 Martin KaFai Lau 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 15bf24c5fb7948564397b450ead0441143d42af2

--------------------------------

[ Upstream commit 525e2f9f ]

st->bucket stores the current bucket number.
st->offset stores the offset within this bucket that is the sk to be
seq_show().  Thus, st->offset only makes sense within the same
st->bucket.

These two variables are an optimization for the common no-lseek case.
When resuming the seq_file iteration (i.e. seq_start()),
tcp_seek_last_pos() tries to continue from the st->offset
at bucket st->bucket.

However, it is possible that the bucket pointed by st->bucket
has changed and st->offset may end up skipping the whole st->bucket
without finding a sk.  In this case, tcp_seek_last_pos() currently
continues to satisfy the offset condition in the next (and incorrect)
bucket.  Instead, regardless of the offset value, the first sk of the
next bucket should be returned.  Thus, "bucket == st->bucket" check is
added to tcp_seek_last_pos().

The chance of hitting this is small and the issue is a decade old,
so targeting for the next tree.

Fixes: a8b690f9 ("tcp: Fix slowness in read /proc/net/tcp")
Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
Signed-off-by: NAndrii Nakryiko <andrii@kernel.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: NYonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200541.1033917-1-kafai@fb.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6b40e3fa

fcntl: fix potential deadlock for &fasync_struct.fa_lock · 36d7274a

由 Desmond Cheong Zhi Xi 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 64dd1fbb0bb743ccd2fb420c441f4ca9732f598f

--------------------------------

[ Upstream commit 2f488f69 ]

There is an existing lock hierarchy of
&dev->event_lock --> &fasync_struct.fa_lock --> &f->f_owner.lock
from the following call chain:

  input_inject_event():
    spin_lock_irqsave(&dev->event_lock,...);
    input_handle_event():
      input_pass_values():
        input_to_handler():
          evdev_events():
            evdev_pass_values():
              spin_lock(&client->buffer_lock);
              __pass_event():
                kill_fasync():
                  kill_fasync_rcu():
                    read_lock(&fa->fa_lock);
                    send_sigio():
                      read_lock_irqsave(&fown->lock,...);

&dev->event_lock is HARDIRQ-safe, so interrupts have to be disabled
while grabbing &fasync_struct.fa_lock, otherwise we invert the lock
hierarchy. However, since kill_fasync which calls kill_fasync_rcu is
an exported symbol, it may not necessarily be called with interrupts
disabled.

As kill_fasync_rcu may be called with interrupts disabled (for
example, in the call chain above), we replace calls to
read_lock/read_unlock on &fasync_struct.fa_lock in kill_fasync_rcu
with read_lock_irqsave/read_unlock_irqrestore.
Signed-off-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com>
Signed-off-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36d7274a

hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() · 14c177da

由 Thomas Gleixner 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit e6c3fefc6bb11bef1bd8adfe37e0f317303ab751

--------------------------------

[ Upstream commit 627ef5ae ]

If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.

If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.

Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.
Reported-by: NLorenzo Colitti <lorenzo@google.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

14c177da

sched/deadline: Fix missing clock update in migrate_task_rq_dl() · 9d89e2e2

由 Dietmar Eggemann 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 293fe77dbfe68c5794be4e76c9212003e9304d24

--------------------------------

[ Upstream commit b4da13aa ]

A missing clock update is causing the following warning:

rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
sub_running_bw.isra.0+0x190/0x1a0
...
CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
Hardware name: WIWYNN Mt.Jade Server System
B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
1.06.20210526) 2021/05/26
...
Call trace:
  sub_running_bw.isra.0+0x190/0x1a0
  migrate_task_rq_dl+0xf8/0x1e0
  set_task_cpu+0xa8/0x1f0
  try_to_wake_up+0x150/0x3d4
  wake_up_q+0x64/0xc0
  __up_write+0xd0/0x1c0
  up_write+0x4c/0x2b0
  cppc_set_perf+0x120/0x2d0
  cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
  __cpufreq_driver_target+0x74/0x140
  sugov_work+0x64/0x80
  kthread_worker_fn+0xe0/0x230
  kthread+0x138/0x140
  ret_from_fork+0x10/0x18

The task causing this is the `cppc_fie` DL task introduced by
commit 1eb5dde6 ("cpufreq: CPPC: Add support for frequency
invariance").

With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
Server):

DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
is in `non_contending` state, migrate_task_rq_dl() calls

  sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
  rq_clock()->assert_clock_updated()

on p.

Fix this by updating the clock for a non_contending task in
migrate_task_rq_dl() before calling sub_running_bw().
Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: NJuri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9d89e2e2

sched/deadline: Fix reset_on_fork reporting of DL tasks · 935318be

由 Quentin Perret 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit bc0d543f53e0d73059cf10733be0109984dadd44

--------------------------------

[ Upstream commit f9509153 ]

It is possible for sched_getattr() to incorrectly report the state of
the reset_on_fork flag when called on a deadline task.

Indeed, if the flag was set on a deadline task using sched_setattr()
with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
p->sched_reset_on_fork will be set, but __setscheduler() will bail out
early, which means that the dl_se->flags will not get updated by
__setscheduler_params()->__setparam_dl(). Consequently, if
sched_getattr() is then called on the task, __getparam_dl() will
override kattr.sched_flags with the now out-of-date copy in dl_se->flags
and report the stale value to userspace.

To fix this, make sure to only copy the flags that are relevant to
sched_deadline to and from the dl_se->flags field.
Signed-off-by: NQuentin Perret <qperret@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

935318be

locking/mutex: Fix HANDOFF condition · 2bf428f5

由 Peter Zijlstra 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit b7a041073f4e72083dcd688a003a59bf9d9c9124

--------------------------------

[ Upstream commit 048661a1 ]

Yanfei reported that setting HANDOFF should not depend on recomputing
@first, only on @first state. Which would then give:

  if (ww_ctx || !first)
    first = __mutex_waiter_is_first(lock, &waiter);
  if (first)
    __mutex_set_flag(lock, MUTEX_FLAG_HANDOFF);

But because 'ww_ctx || !first' is basically 'always' and the test for
first is relatively cheap, omit that first branch entirely.
Reported-by: NYanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NWaiman Long <longman@redhat.com>
Reviewed-by: NYanfei Xu <yanfei.xu@windriver.com>
Link: https://lore.kernel.org/r/20210630154114.896786297@infradead.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2bf428f5

ipv4/icmp: l3mdev: Perform icmp error route lookup on source device routing table (v2) · 82b68d94

由 Mathieu Desnoyers 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit 69178f2d652e64991dc812be24d0776a731f447d

--------------------------------

commit e1e84eb5 upstream.

As per RFC792, ICMP errors should be sent to the source host.

However, in configurations with Virtual Routing and Forwarding tables,
looking up which routing table to use is currently done by using the
destination net_device.

commit 9d1a6c4e ("net: icmp_route_lookup should use rt dev to
determine L3 domain") changes the interface passed to
l3mdev_master_ifindex() and inet_addr_type_dev_table() from skb_in->dev
to skb_dst(skb_in)->dev. This effectively uses the destination device
rather than the source device for choosing which routing table should be
used to lookup where to send the ICMP error.

Therefore, if the source and destination interfaces are within separate
VRFs, or one in the global routing table and the other in a VRF, looking
up the source host in the destination interface's routing table will
fail if the destination interface's routing table contains no route to
the source host.

One observable effect of this issue is that traceroute does not work in
the following cases:

- Route leaking between global routing table and VRF
- Route leaking between VRFs

Preferably use the source device routing table when sending ICMP error
messages. If no source device is set, fall-back on the destination
device routing table. Else, use the main routing table (index 0).

[ It has been pointed out that a similar issue may exist with ICMP
errors triggered when forwarding between network namespaces. It would
be worthwhile to investigate, but is outside of the scope of this
investigation. ]

[ It has also been pointed out that a similar issue exists with
unreachable / fragmentation needed messages, which can be triggered by
changing the MTU of eth1 in r1 to 1400 and running:

ip netns exec h1 ping -s 1450 -Mdo -c1 172.16.2.2

Some investigation points to raw_icmp_error() and raw_err() as being
involved in this last scenario. The focus of this patch is TTL expired
ICMP messages, which go through icmp_route_lookup.
Investigation of failure modes related to raw_icmp_error() is beyond
this investigation's scope. ]

Fixes: 9d1a6c4e ("net: icmp_route_lookup should use rt dev to determine L3 domain")
Link: https://tools.ietf.org/html/rfc792Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

82b68d94

perf/x86/intel/pt: Fix mask of num_address_ranges · ce780060

由 Xiaoyao Li 提交于 11月 12, 2021

stable inclusion
from linux-4.19.207
commit f109e8a1678ce920b7f0df7865f2a31754ec5d1c

--------------------------------

[ Upstream commit c53c6b74 ]

Per SDM, bit 2:0 of CPUID(0x14,1).EAX[2:0] reports the number of
configurable address ranges for filtering, not bit 1:0.
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
Link: https://lkml.kernel.org/r/20210824040622.4081502-1-xiaoyao.li@intel.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ce780060

Revert "EMMC: ascend customized emmc host" · 3af94489