提交 · 5.10.0-60.36.0 · openeuler / Kernel

31 5月, 2022 38 次提交

net, xdp: Update pkt_type if generic XDP changes unicast MAC · 5dbf395c

由 Martin Willi 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 22b60343
category: bugfix
bugzilla: 186880 https://gitee.com/openeuler/kernel/issues/I5A7W4

--------------------------------

If a generic XDP program changes the destination MAC address from/to
multicast/broadcast, the skb->pkt_type is updated to properly handle
the packet when passed up the stack. When changing the MAC from/to
the NICs MAC, PACKET_HOST/OTHERHOST is not updated, though, making
the behavior different from that of native XDP.

Remember the PACKET_HOST/OTHERHOST state before calling the program
in generic XDP, and update pkt_type accordingly if the destination
MAC address has changed. As eth_type_trans() assumes a default
pkt_type of PACKET_HOST, restore that before calling it.

The use case for this is when a XDP program wants to push received
packets up the stack by rewriting the MAC to the NICs MAC, for
example by cluster nodes sharing MAC addresses.

Fixes: 29724956 ("net: fix generic XDP to handle if eth header was mangled")
Signed-off-by: NMartin Willi <martin@strongswan.org>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210419141559.8611-1-martin@strongswan.orgSigned-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Conflict:
	net/core/dev.c
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5dbf395c

KVM: x86/mmu: fix NULL pointer dereference on guest INVPCID · 0f9554c6

由 Paolo Bonzini 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.18
commit 9f46c187
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I59I19
CVE: CVE-2022-1789

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9f46c187e2e680ecd9de7983e4d081c3391acc76

--------------------------------

With shadow paging enabled, the INVPCID instruction results in a call
to kvm_mmu_invpcid_gva.  If INVPCID is executed with CR0.PG=0, the
invlpg callback is not set and the result is a NULL pointer dereference.
Fix it trivially by checking for mmu->invlpg before every call.

There are other possibilities:

- check for CR0.PG, because KVM (like all Intel processors after P5)
  flushes guest TLB on CR0.PG changes so that INVPCID/INVLPG are a
  nop with paging disabled

- check for EFER.LMA, because KVM syncs and flushes when switching
  MMU contexts outside of 64-bit mode

All of these are tricky, go for the simple solution.  This is CVE-2022-1789.
Reported-by: NYongkang Jia <kangel@zju.edu.cn>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NYipeng Zou <zouyipeng@huawei.com>
Reviewed-by: NZhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0f9554c6

sched/psi: report zeroes for CPU full at the system level · c8d90d02

由 Chengming Zhou 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.18-rc4
commit 890d550d
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=890d550d7dbac7a31ecaa78732aa22be282bb6b8

--------------------------------

Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.

% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

The PSI_CPU_FULL state is introduced by commit e7fcd762
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.

Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.

Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.

Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.

Fixes: e7fcd762 ("psi: Add PSI_CPU_FULL state")
Reported-by: NMartin Steigerwald <Martin.Steigerwald@proact.de>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.comReviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c8d90d02

psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim · 66f30a83

由 Brian Chen 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.17-rc1
commit cb0e52b7
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf

--------------------------------

We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.

A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.

The code is confused about memstall && running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.

        case PSI_MEM_FULL:
-               return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
+               return unlikely(tasks[NR_MEMSTALL] &&
+                       tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.
Signed-off-by: NBrian Chen <brianchen118@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

66f30a83

psi: Fix psi state corruption when schedule() races with cgroup move · b378d69f

由 Johannes Weiner 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit d583d360
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d583d360a620e6229422b3455d0be082b8255f5e

--------------------------------

4117cebf ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.orgSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b378d69f

psi: Reduce calls to sched_clock() in psi · 5f48fa66

由 Shakeel Butt 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit df774306
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=df77430639c9cf73559bac0f25084518bf9a812d

--------------------------------

We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().

Performed perf bench on Intel Broadwell with 3 levels of cgroup.

Before the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.747 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.516 [sec]

       3.516689 usecs/op
         284358 ops/sec

After the patch:

$ perf bench sched all
 # Running sched/messaging benchmark...
 # 20 sender and receiver processes per group
 # 10 groups == 400 processes run

     Total time: 0.640 [sec]

 # Running sched/pipe benchmark...
 # Executed 1000000 pipe operations between two processes

     Total time: 3.329 [sec]

       3.329820 usecs/op
         300316 ops/sec
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5f48fa66

psi: Optimize task switch inside shared cgroups · 67c22ceb

由 Chengming Zhou 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 4117cebf
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4117cebf1a9fcbf35b9aabf0e37b6c5eea296798

--------------------------------

The commit 36b238d5 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.

We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.

sleep before:
psi_dequeue()
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
  while ((group = iterate_groups(next)))  # all ancestors
    psi_group_change(next, .set=TSK_ONCPU)

sleep after:
psi_dequeue()
  nop
psi_task_switch()
  while ((group = iterate_groups(next)))  # until (prev & next)
    psi_group_change(next, .set=TSK_ONCPU)
  while ((group = iterate_groups(prev)))  # all ancestors
    psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)

When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.
Co-developed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

67c22ceb

psi: Pressure states are unlikely · 8f0aae7b

由 Johannes Weiner 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit fddc8bab
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fddc8bab531e217806b84906681324377d741c6c

--------------------------------

Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8f0aae7b

psi: Use ONCPU state tracking machinery to detect reclaim · 0eca26dc

由 Chengming Zhou 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 7fae6c81
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7fae6c8171d20ac55402930ee8ae760cf85dff7b

--------------------------------

Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.

In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.

Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.
Co-developed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0eca26dc

psi: Add PSI_CPU_FULL state · 352a2f66

由 Chengming Zhou 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.13-rc1
commit e7fcd762
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e7fcd762282332f765af2035a9568fb126fa3c01

--------------------------------

The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.
Co-developed-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

352a2f66

block/psi: remove PSI annotations from direct IO · 6c220e98

由 Pavel Begunkov 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.12-rc1
commit 0cf41e5e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0cf41e5e9bafc185490624c3e321c915885a91f3

--------------------------------

Direct IO does not operate on the current working set of pages managed
by the kernel, so it should not be accounted as memory stall to PSI
infrastructure.

The block layer and iomap direct IO use bio_iov_iter_get_pages()
to build bios, and they are the only users of it, so to avoid PSI
tracking for them clear out BIO_WORKINGSET flag. Do same for
dio_bio_submit() because fs/direct_io constructs bios by hand directly
calling bio_add_page().
Reported-by: NChristoph Hellwig <hch@infradead.org>
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6c220e98

psi: make kabi compatibility for psi in struct cgroup · 2b61a0d5

由 Chen Wandun 提交于 5月 31, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

--------------------------------

The cgroup structures are all allocated by the core kernel
code at run time. It is also accessed only the cgroup core code
and so changes made to the cgroup structure should not affect
third-party kernel modules. However, a number of important kernel
data structures do contain pointer to a cgroup structure and so
the kABI signature has to be maintained.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: Nzhangjialin 00591957 <zhangjialin11@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2b61a0d5

psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave} · 7b00adb6

由 Chen Wandun 提交于 5月 31, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

--------------------------------

Two tracepoints are added we can easily use other tools
such as ebpf, ftrace, perf to monitor the memstall data
and do some analysis.

The output of these tracepoints is,
      kcompactd0-58      [001] ....   902.709565: psi_memstall_enter: kcompactd
         kswapd0-132     [003] ....   902.709569: psi_memstall_leave: balance_pgdat
      kcompactd0-58      [001] ....   902.775230: psi_memstall_leave: kcompactd
         kswapd0-132     [003] ....  1337.754598: psi_memstall_enter: balance_pgdat
         kswapd0-132     [003] ....  1337.756076: psi_memstall_leave: balance_pgdat
      kcompactd0-58      [003] ....  1337.756213: psi_memstall_enter: kcompactd
      kcompactd0-58      [003] ....  1337.893188: psi_memstall_leave: kcompactd
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7b00adb6

psi: fix wrong iteration in iterate_groups · 915777f4

由 Chen Wandun 提交于 5月 31, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I562O9
CVE: NA
backport: openEuler-22.03-LTS

--------------------------------

It is different to get the cgroup that is used to update psi info in
cgroup v1 and cgroup v2.

task_cgroup can only used in cgroup v1, so add branch to achieve this.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

915777f4

config: change CONFIG_DMATEST from y to m · 34722d01

由 Chao Liu 提交于 5月 31, 2022

euler inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I59Q07
CVE: NA

--------------------------------

Simple DMA test client. Say N unless you're debugging a DMA Device driver.
It doesn't need to be compiled into the vmlinux, change it to m.
Signed-off-by: NChao Liu <liuchao173@huawei.com>
Reviewed-by: NKai Liu <kai.liu@suse.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

34722d01

perf: Fix sys_perf_event_open() race against self · a9419879

由 Peter Zijlstra 提交于 5月 31, 2022

stable inclusion
from stable-v5.10.118
commit 3ee8e109c3c316073a3e0f83ec0769c7ee8a7375
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I593PQ
CVE: CVE-2022-1729

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3ee8e109c3c316073a3e0f83ec0769c7ee8a7375

--------------------------------

commit 3ac6487e upstream.

Norbert reported that it's possible to race sys_perf_event_open() such
that the looser ends up in another context from the group leader,
triggering many WARNs.

The move_group case checks for races against itself, but the
!move_group case doesn't, seemingly relying on the previous
group_leader->ctx == ctx check. However, that check is racy due to not
holding any locks at that time.

Therefore, re-check the result after acquiring locks and bailing
if they no longer match.

Additionally, clarify the not_move_group case from the
move_group-vs-move_group race.

Fixes: f63a8daa ("perf: Fix event->ctx locking")
Reported-by: NNorbert Slusarek <nslusarek@gmx.net>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLi Huafei <lihuafei1@huawei.com>
Reviewed-by: NYang Jihong <yangjihong1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a9419879

blk-mq: fix kabi broken by "blk-mq: Use request queue-wide tags for tagset-wide sbitmap" · b97f541e

由 Yufen Yu 提交于 5月 31, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I597XM
CVE: NA

---------------------------
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b97f541e

blk-mq: fix use-after-free in blk_mq_exit_sched · be94d1e5

由 Ming Lei 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.14-rc1
commit f0c1c4d2
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I597XM
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f0c1c4d2864ed614f90d2da1bab1a1c42907b940

--------------------------------

tagset can't be used after blk_cleanup_queue() is returned because
freeing tagset usually follows blk_clenup_queue(). Commit d97e594c
("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds
check on q->tag_set->flags in blk_mq_exit_sched(), and causes
use-after-free.

Fixes it by using hctx->flags.

Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com
Fixes: d97e594c ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap")
Cc: John Garry <john.garry@huawei.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Tested-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20210609063046.122843-1-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

be94d1e5

blk-mq: Use request queue-wide tags for tagset-wide sbitmap · 02faa4ed

由 John Garry 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.14-rc1
commit d97e594c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I597XM
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d97e594c51660bea510a387731637b894651e4b5

--------------------------------

The tags used for an IO scheduler are currently per hctx.

As such, when q->nr_hw_queues grows, so does the request queue total IO
scheduler tag depth.

This may cause problems for SCSI MQ HBAs whose total driver depth is
fixed.

Ming and Yanhui report higher CPU usage and lower throughput in scenarios
where the fixed total driver tag depth is appreciably lower than the total
scheduler tag depth:
https://lore.kernel.org/linux-block/440dfcfc-1a2c-bd98-1161-cec4d78c6dfc@huawei.com/T/#mc0d6d4f95275a2743d1c8c3e4dc9ff6c9aa3a76b

In that scenario, since the scheduler tag is got first, much contention
is introduced since a driver tag may not be available after we have got
the sched tag.

Improve this scenario by introducing request queue-wide tags for when
a tagset-wide sbitmap is used. The static sched requests are still
allocated per hctx, as requests are initialised per hctx, as in
blk_mq_init_request(..., hctx_idx, ...) ->
set->ops->init_request(.., hctx_idx, ...).

For simplicity of resizing the request queue sbitmap when updating the
request queue depth, just init at the max possible size, so we don't need
to deal with the possibly with swapping out a new sbitmap for old if
we need to grow.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-3-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
conflict:
	block/blk-mq-sched.c
	block/blk-mq-sched.h
	block/blk-mq-tag.c
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

02faa4ed

blk-mq: Some tag allocation code refactoring · ce8b3a1f

由 John Garry 提交于 5月 31, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 56b68085
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I597XM
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=56b68085e536eff2676108f2f8356889a7dbbf55

--------------------------------

The tag allocation code to alloc the sbitmap pairs is common for regular
bitmaps tags and shared sbitmap, so refactor into a common function.

Also remove superfluous "flags" argument from blk_mq_init_shared_sbitmap().
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/1620907258-30910-2-git-send-email-john.garry@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
conflict:
	block/blk-mq.c
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ce8b3a1f

arm64: Add memmap reserve range check to avoid conflict · 6075cf62

由 Peng Liu 提交于 5月 31, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59AN8
CVE: NA

--------------------------------

The user specificed memmap-reserve range may overlap in-use memory
region, and users are hard to avoid this due to KASLR. Thus, the
reduplicative memmap-reserve range should be ignored. Furthermore,
to be consistent with INITRD, the range that not in a memory region
will also be ignored.

Fixes: d05cfbd9 ("arm64: Add support for memmap kernel parameters")
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6075cf62

ext4: add reserved GDT blocks check · acba9828

由 Zhang Yi 提交于 5月 31, 2022

hulk inclusion
category: bugfix
bugzilla: 186835, https://gitee.com/openeuler/kernel/issues/I59KJ1
CVE: NA

---------------------------

We capture a NULL pointer issue when resizing a corrupt ext4 image which
freshly clear resize_inode feature (not run e2fsck). It could be simply
reproduced by following steps. The problem is because of the
resize_inode feature was cleared, and it will convert the filesystem to
meta_bg mode in ext4_resize_fs(), but the es->s_reserved_gdt_blocks was
not cleared together, so could we mistakenly call reserve_backup_gdb()
and passing an uninitialized resize_inode to it when adding new group
descriptors.

 mkfs.ext4 /dev/sda 3G
 tune2fs -O ^resize_inode /dev/sda #forget to run requested e2fsck
 mount /dev/sda /mnt
 resize2fs /dev/sda 8G

 ========
 BUG: kernel NULL pointer dereference, address: 0000000000000028
 CPU: 19 PID: 3243 Comm: resize2fs Not tainted 5.18.0-rc7-00001-gfde086c5ebfd #748
 ...
 RIP: 0010:ext4_flex_group_add+0xe08/0x2570
 ...
 Call Trace:
  <TASK>
  ext4_resize_fs+0xbec/0x1660
  __ext4_ioctl+0x1749/0x24e0
  ext4_ioctl+0x12/0x20
  __x64_sys_ioctl+0xa6/0x110
  do_syscall_64+0x3b/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7f2dd739617b
 ========

The fix is simple, add a check in ext4_resize_fs() to make sure that the
es->s_reserved_gdt_blocks is zero when the resize_inode feature is
disabled.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: Nzhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

acba9828

ax25: Fix UAF bugs in ax25 timers · 07480352