提交 · 160c26bcd3bd210068ef07868573d56527fd0e13 · openeuler / Kernel

28 4月, 2023 1 次提交

mm: Use wrapper to fix KABI broken in struct mempolicy · 160c26bc

由 Ma Wupeng 提交于 4月 27, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

--------------------------------

Since struct mempolicy is commonly used by external users, use wrapper
to fix KABI broken in struct mempolicy.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

160c26bc

27 4月, 2023 16 次提交

mm/mempolicy: fix memory leak in set_mempolicy_home_node system call · 334e66da

由 Mathieu Desnoyers 提交于 4月 27, 2023

mainline inclusion
from mainline-v6.2-rc1
commit 38ce7c9b
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38ce7c9bdfc228c14d7621ba36d3eebedd9d4f76

--------------------------------

When encountering any vma in the range with policy other than MPOL_BIND or
MPOL_PREFERRED_MANY, an error is returned without issuing a mpol_put on
the policy just allocated with mpol_dup().

This allows arbitrary users to leak kernel memory.

Link: https://lkml.kernel.org/r/20221215194621.202816-1-mathieu.desnoyers@efficios.com
Fixes: c6018b4b ("mm/mempolicy: add set_mempolicy_home_node syscall")
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <stable@vger.kernel.org>	[5.17+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

334e66da

tools headers UAPI: Sync files changed by new set_mempolicy_home_node syscall · 7d438edb

由 Arnaldo Carvalho de Melo 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.17-rc1
commit 6e10e219
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6e10e21915c1ab6eaa145f7b5ebaf4500af1b011

--------------------------------

To pick the changes in these csets:

  21b084fd ("mm/mempolicy: wire up syscall set_mempolicy_home_node")

That add support for this new syscall in tools such as 'perf trace'.

For instance, this is now possible:

  [root@five ~]# perf trace -e set_mempolicy_home_node
  ^C[root@five ~]#
  [root@five ~]# perf trace -v -e set_mempolicy_home_node
  Using CPUID AuthenticAMD-25-21-0
  event qualifier tracepoint filter: (common_pid != 253729 && common_pid != 3585) && (id == 450)
  mmap size 528384B
  ^C[root@five ~]
  [root@five ~]# perf trace -v -e set*  --max-events 5
  Using CPUID AuthenticAMD-25-21-0
  event qualifier tracepoint filter: (common_pid != 253734 && common_pid != 3585) && (id == 38 || id == 54 || id == 105 || id == 106 || id == 109 || id == 112 || id == 113 || id == 114 || id == 116 || id == 117 || id == 119 || id == 122 || id == 123 || id == 141 || id == 160 || id == 164 || id == 170 || id == 171 || id == 188 || id == 205 || id == 218 || id == 238 || id == 273 || id == 308 || id == 450)
  mmap size 528384B
       0.000 ( 0.008 ms): bash/253735 setpgid(pid: 253735 (bash), pgid: 253735 (bash))      = 0
    6849.011 ( 0.008 ms): bash/16046 setpgid(pid: 253736 (bash), pgid: 253736 (bash))       = 0
    6849.080 ( 0.005 ms): bash/253736 setpgid(pid: 253736 (bash), pgid: 253736 (bash))      = 0
    7437.718 ( 0.009 ms): gnome-shell/253737 set_robust_list(head: 0x7f34b527e920, len: 24) = 0
   13445.986 ( 0.010 ms): bash/16046 setpgid(pid: 253738 (bash), pgid: 253738 (bash))       = 0
  [root@five ~]#

That is the filter expression attached to the raw_syscalls:sys_{enter,exit}
tracepoints.

  $ find tools/perf/arch/ -name "syscall*tbl" | xargs grep -w set_mempolicy_home_node
  tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl:450	common	set_mempolicy_home_node		sys_set_mempolicy_home_node
  tools/perf/arch/powerpc/entry/syscalls/syscall.tbl:450 	nospu	set_mempolicy_home_node		sys_set_mempolicy_home_node
  tools/perf/arch/s390/entry/syscalls/syscall.tbl:450  common	set_mempolicy_home_node	sys_set_mempolicy_home_node	sys_set_mempolicy_home_node
  tools/perf/arch/x86/entry/syscalls/syscall_64.tbl:450	common	set_mempolicy_home_node	sys_set_mempolicy_home_node
  $

  $ grep -w set_mempolicy_home_node /tmp/build/perf/arch/x86/include/generated/asm/syscalls_64.c
	[450] = "set_mempolicy_home_node",
  $

This addresses these perf build warnings:

  Warning: Kernel ABI header at 'tools/include/uapi/asm-generic/unistd.h' differs from latest version at 'include/uapi/asm-generic/unistd.h'
  diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h
  Warning: Kernel ABI header at 'tools/perf/arch/x86/entry/syscalls/syscall_64.tbl' differs from latest version at 'arch/x86/entry/syscalls/syscall_64.tbl'
  diff -u tools/perf/arch/x86/entry/syscalls/syscall_64.tbl arch/x86/entry/syscalls/syscall_64.tbl
  Warning: Kernel ABI header at 'tools/perf/arch/powerpc/entry/syscalls/syscall.tbl' differs from latest version at 'arch/powerpc/kernel/syscalls/syscall.tbl'
  diff -u tools/perf/arch/powerpc/entry/syscalls/syscall.tbl arch/powerpc/kernel/syscalls/syscall.tbl
  Warning: Kernel ABI header at 'tools/perf/arch/s390/entry/syscalls/syscall.tbl' differs from latest version at 'arch/s390/kernel/syscalls/syscall.tbl'
  diff -u tools/perf/arch/s390/entry/syscalls/syscall.tbl arch/s390/kernel/syscalls/syscall.tbl
  Warning: Kernel ABI header at 'tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl' differs from latest version at 'arch/mips/kernel/syscalls/syscall_n64.tbl'
  diff -u tools/perf/arch/mips/entry/syscalls/syscall_n64.tbl arch/mips/kernel/syscalls/syscall_n64.tbl

Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

7d438edb

mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process · 94e5cb7e

由 Feng Tang 提交于 4月 27, 2023

mainline inclusion
from mainline-v6.1-rc1
commit d2226ebd
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2226ebd5484afcf9f9b71b394ec1567a7730eb1

--------------------------------

Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
commit b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
preferred nodes"), the policy_nodemask_current()'s semantics for this new
policy has been changed, which returns 'preferred' nodes instead of
'allowed' nodes.

With the changed semantic of policy_nodemask_current, a task with
MPOL_PREFERRED_MANY policy could fail to get its reservation even though
it can fall back to other nodes (either defined by cpusets or all online
nodes) for that reservation failing mmap calles unnecessarily early.

The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
because they, unlike MPOL_MBIND, do not pose any actual hard constrain.

Michal suggested the policy_nodemask_current() is only used by hugetlb,
and could be moved to hugetlb code with more explicit name to enforce the
'allowed' semantics for which only MPOL_BIND policy matters.

apply_policy_zone() is made extern to be called in hugetlb code and its
return value is changed to bool.

[1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/

Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com
Fixes: b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Reported-by: NMuchun Song <songmuchun@bytedance.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ben Widawsky <bwidawsk@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

Conflicts:
	include/linux/mempolicy.h
	mm/hugetlb.c
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

94e5cb7e

mm/mempolicy: wire up syscall set_mempolicy_home_node · 1e3451e0

由 Aneesh Kumar K.V 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.17-rc1
commit 21b084fd
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=21b084fdf2a49ca1634e8e360e9ab6f9ff0dee11

--------------------------------

Link: https://lkml.kernel.org/r/20211202123810.267175-4-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

1e3451e0

mm/mempolicy: add set_mempolicy_home_node syscall · 3b354717

由 Aneesh Kumar K.V 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.17-rc1
commit c6018b4b
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6018b4b254971863bd0ad36bb5e7d0fa0f0ddb0

--------------------------------

This syscall can be used to set a home node for the MPOL_BIND and
MPOL_PREFERRED_MANY memory policy.  Users should use this syscall after
setting up a memory policy for the specified range as shown below.

  mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,
        new_nodes->size + 1, 0);
  sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size,
				home_node, 0);

The syscall allows specifying a home node/preferred node from which
kernel will fulfill memory allocation requests first.

For address range with MPOL_BIND memory policy, if nodemask specifies
more than one node, page allocations will come from the node in the
nodemask with sufficient free memory that is closest to the home
node/preferred node.

For MPOL_PREFERRED_MANY if the nodemask specifies more than one node,
page allocation will come from the node in the nodemask with sufficient
free memory that is closest to the home node/preferred node.  If there
is not enough memory in all the nodes specified in the nodemask, the
allocation will be attempted from the closest numa node to the home node
in the system.

This helps applications to hint at a memory allocation preference node
and fallback to _only_ a set of nodes if the memory is not available on
the preferred node.  Fallback allocation is attempted from the node
which is nearest to the preferred node.

This helps applications to have control on memory allocation numa nodes
and avoids default fallback to slow memory NUMA nodes.  For example a
system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of
slow memory

 new_nodes = numa_bitmask_alloc(nr_nodes);

 numa_bitmask_setbit(new_nodes, 1);
 numa_bitmask_setbit(new_nodes, 2);
 numa_bitmask_setbit(new_nodes, 3);

 p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0);
 mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp,  new_nodes->size + 1, 0);

 sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0);

This will allocate from nodes closer to node 2 and will make sure the
kernel will only allocate from nodes 1, 2, and 3.  Memory will not be
allocated from slow memory nodes 10, 11, and 12.  This differs from
default MPOL_BIND behavior in that with default MPOL_BIND the allocation
will be attempted from node closer to the local node.  One of the
reasons to specify a home node is to allow allocations from cpu less
NUMA node and its nearby NUMA nodes.

With MPOL_PREFERRED_MANY on the other hand will first try to allocate
from the closest node to node 2 from the node list 1, 2 and 3.  If those
nodes don't have enough memory, kernel will allocate from slow memory
node 10, 11 and 12 which ever is closer to node 2.

Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	include/linux/mempolicy.h
	mm/mempolicy.c
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

3b354717

mm/mempolicy: use policy_node helper with MPOL_PREFERRED_MANY · cb49a627

由 Aneesh Kumar K.V 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.17-rc1
commit c0455116
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c04551162167368022a61899843821bbf015b473

--------------------------------

Patch series "mm: add new syscall set_mempolicy_home_node", v6.

This patch (of 3):

A followup patch will enable setting a home node with
MPOL_PREFERRED_MANY memory policy.  To facilitate that switch to using
policy_node helper.  There is no functional change in this patch.

Link: https://lkml.kernel.org/r/20211202123810.267175-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20211202123810.267175-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

cb49a627

mm/mempolicy: advertise new MPOL_PREFERRED_MANY · bf5c2726

由 Ben Widawsky 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.15-rc1
commit a38a59fd
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a38a59fdfa10be55d08e4530923d950e739ac6a2

--------------------------------

Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY.

MPOL_PREFERRED_MANY will be adequately documented in the internal
admin-guide with this patch.  Eventually, the man pages for mbind(2),
get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text
about this mode.  Those shall contain the canonical reference.

NUMA systems continue to become more prevalent.  New technologies like
PMEM make finer grain control over memory access patterns increasingly
desirable.  MPOL_PREFERRED_MANY allows userspace to specify a set of nodes
that will be tried first when performing allocations.  If those
allocations fail, all remaining nodes will be tried.  It's a straight
forward API which solves many of the presumptive needs of system
administrators wanting to optimize workloads on such machines.  The mode
will work either per VMA, or per thread.

[Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY]

Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.comSigned-off-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

bf5c2726

mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY · 27a782f7

由 Ben Widawsky 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.15-rc1
commit cfcaa66f
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cfcaa66f803233c50e17239469f6c96136a673a1

--------------------------------

Implement the missing huge page allocation functionality while obeying the
preferred node semantics.  This is similar to the implementation for
general page allocation, as it uses a fallback mechanism to try multiple
preferred nodes first, and then all other nodes.

To avoid adding too many "#ifdef CONFIG_NUMA" check, add a helper function
in mempolicy.h to check whether a mempolicy is MPOL_PREFERRED_MANY.

[akpm@linux-foundation.org: fix compiling issue when merging with other hugetlb patch]
[Thanks to 0day bot for catching the !CONFIG_NUMA compiling issue]
[mhocko@suse.com: suggest to remove the #ifdef CONFIG_NUMA check]
[ben.widawsky@intel.com: add helpers to avoid ifdefs]
  Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com
  Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com
  Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.com
[nathan@kernel.org: initialize page to NULL in alloc_buddy_huge_page_with_mpol()]
  Link: https://lkml.kernel.org/r/20210810200632.3812797-1-nathan@kernel.org

Link: https://lore.kernel.org/r/20200630212517.308045-12-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-4-git-send-email-feng.tang@intel.com
Link: https://lkml.kernel.org/r/20210809024430.GA46432@shbuild999.sh.intel.comSigned-off-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Signed-off-by: NNathan Chancellor <nathan@kernel.org>
Co-developed-by: NFeng Tang <feng.tang@intel.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	include/linux/mempolicy.h
	mm/hugetlb.c
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

27a782f7

mm/memplicy: add page allocation function for MPOL_PREFERRED_MANY policy · 831bd34c

由 Feng Tang 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.15-rc1
commit 4c54d949
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c54d94908e089e9741513797eac30a8b8217034

--------------------------------

The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it
will first try to allocate memory from the preferred node(s), and fallback
to all nodes in system when first try fails.

Add a dedicated function alloc_pages_preferred_many() for it just like for
'interleave' policy, which will be used by 2 general memoory allocation
APIs: alloc_pages() and alloc_pages_vma()

Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.comSuggested-by: NMichal Hocko <mhocko@suse.com>
Originally-by: NBen Widawsky <ben.widawsky@intel.com>
Co-developed-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

831bd34c

mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes · 52e5df72

由 Dave Hansen 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.15-rc1
commit b27abacc
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b27abaccf8e8b012f126da0c2a1ab32723ec8b9f

--------------------------------

Patch series "Introduce multi-preference mempolicy", v7.

This patch series introduces the concept of the MPOL_PREFERRED_MANY
mempolicy.  This mempolicy mode can be used with either the
set_mempolicy(2) or mbind(2) interfaces.  Like the MPOL_PREFERRED
interface, it allows an application to set a preference for nodes which
will fulfil memory allocation requests.  Unlike the MPOL_PREFERRED mode,
it takes a set of nodes.  Like the MPOL_BIND interface, it works over a
set of nodes.  Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the
OOM killer if those preferred nodes are not available.

Along with these patches are patches for libnuma, numactl, numademo, and
memhog.  They still need some polish, but can be found here:
https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new
usage: `numactl -P 0,3,4`

The goal of the new mode is to enable some use-cases when using tiered memory
usage models which I've lovingly named.

1a. The Hare - The interconnect is fast enough to meet bandwidth and
    latency requirements allowing preference to be given to all nodes with
    "fast" memory.
1b. The Indiscriminate Hare - An application knows it wants fast
    memory (or perhaps slow memory), but doesn't care which node it runs
    on.  The application can prefer a set of nodes and then xpu bind to
    the local node (cpu, accelerator, etc).  This reverses the nodes are
    chosen today where the kernel attempts to use local memory to the CPU
    whenever possible.  This will attempt to use the local accelerator to
    the memory.
2.  The Tortoise - The administrator (or the application itself) is
    aware it only needs slow memory, and so can prefer that.

Much of this is almost achievable with the bind interface, but the bind
interface suffers from an inability to fallback to another set of nodes if
binding fails to all nodes in the nodemask.

Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the
preference.

> /* Set first two nodes as preferred in an 8 node system. */
> const unsigned long nodes = 0x3
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

> /* Mimic interleave policy, but have fallback *.
> const unsigned long nodes = 0xaa
> set_mempolicy(MPOL_PREFER_MANY, &nodes, 8);

Some internal discussion took place around the interface. There are two
alternatives which we have discussed, plus one I stuck in:

1. Ordered list of nodes.  Currently it's believed that the added
   complexity is nod needed for expected usecases.
2. A flag for bind to allow falling back to other nodes.  This
   confuses the notion of binding and is less flexible than the current
   solution.
3. Create flags or new modes that helps with some ordering.  This
   offers both a friendlier API as well as a solution for more customized
   usage.  It's unknown if it's worth the complexity to support this.
   Here is sample code for how this might work:

> // Prefer specific nodes for some something wacky
> set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024);
>
> // Default
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0);
> // which is the same as
> set_mempolicy(MPOL_DEFAULT, NULL, 0);
>
> // The Hare
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0);
>
> // The Tortoise
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0);
>
> // Prefer the fast memory of the first two sockets
> set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2);
>

This patch (of 5):

The NUMA APIs currently allow passing in a "preferred node" as a single
bit set in a nodemask.  If more than one bit it set, bits after the first
are ignored.

This single node is generally OK for location-based NUMA where memory
being allocated will eventually be operated on by a single CPU.  However,
in systems with multiple memory types, folks want to target a *type* of
memory instead of a location.  For instance, someone might want some
high-bandwidth memory but do not care about the CPU next to which it is
allocated.  Or, they want a cheap, high capacity allocation and want to
target all NUMA nodes which have persistent memory in volatile mode.  In
both of these cases, the application wants to target a *set* of nodes, but
does not want strict MPOL_BIND behavior as that could lead to OOM killer
or SIGSEGV.

So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes
requirement.  This is not a pie-in-the-sky dream for an API.  This was a
response to a specific ask of more than one group at Intel.  Specifically:

1. There are existing libraries that target memory types such as
   https://github.com/memkind/memkind.  These are known to suffer from
   SIGSEGV's when memory is low on targeted memory "kinds" that span more
   than one node.  The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an
   example of this.

2. Volatile-use persistent memory users want to have a memory policy
   which is targeted at either "cheap and slow" (PMEM) or "expensive and
   fast" (DRAM).  However, they do not want to experience allocation
   failures when the targeted type is unavailable.

3. Allocate-then-run.  Generally, we let the process scheduler decide
   on which physical CPU to run a task.  That location provides a default
   allocation policy, and memory availability is not generally considered
   when placing tasks.  For situations where memory is valuable and
   constrained, some users want to allocate memory first, *then* allocate
   close compute resources to the allocation.  This is the reverse of the
   normal (CPU) model.  Accelerators such as GPUs that operate on
   core-mm-managed memory are interested in this model.

A check is added in sanitize_mpol_flags() to not permit 'prefer_many'
policy to be used for now, and will be removed in later patch after all
implementations for 'prefer_many' are ready, as suggested by Michal Hocko.

[mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling]

Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com
Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com
Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.comCo-developed-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NBen Widawsky <ben.widawsky@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NFeng Tang <feng.tang@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>b
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	mm/mempolicy.c
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

52e5df72

mm/mempolicy: unify the parameter sanity check for mbind and set_mempolicy · 6b918c6f

由 Feng Tang 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.14-rc1
commit 95837924
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95837924587c60425f941dc8cbfba61cb964fcb5

--------------------------------

Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same
operation for parameter sanity check.

Add a helper function to unify the code to reduce the redundancy, and make
it easier for changing the sanity check code in future.

[thanks to David Rientjes for suggesting using helper function instead of
macro].

[feng.tang@intel.com: add comment]
  Link: https://lkml.kernel.org/r/1622560492-1294-4-git-send-email-feng.tang@intel.com

Link: https://lkml.kernel.org/r/1622469956-82897-4-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

6b918c6f

mm/mempolicy: don't handle MPOL_LOCAL like a fake MPOL_PREFERRED policy · d634f94d

由 Feng Tang 提交于 4月 27, 2023

mainline inclusion
from mainline-v5.14-rc1
commit 7858d7bc
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7858d7bca7fbbbbd5b940d2ec371b2d060b21b84

--------------------------------

MPOL_LOCAL policy has been setup as a real policy, but it is still handled
like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit
set, and there are many places having to judge the real 'prefer' or the
'local' policy, which are quite confusing.

In current code, there are 4 cases that MPOL_LOCAL are used:

1. user specifies 'local' policy

2. user specifies 'prefer' policy, but with empty nodemask

3. system 'default' policy is used

4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES
   flag set, and when it is 'rebind' to a nodemask which doesn't contains
   the 'preferred' node, it will perform as 'local' policy

So make 'local' a real policy instead of a fake 'prefer' one, and kill
MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading.

For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal
Hocko pointed out:

: I do believe that rebinding preferred policy is just bogus and it should
: be dropped altogether on the ground that a preference is a mere hint from
: userspace where to start the allocation.  Unless I am missing something
: cpusets will be always authoritative for the final placement.  The
: preferred node just acts as a starting point and it should be really
: preserved when cpusets changes.  Otherwise we have a very subtle behavior
: corner cases.

So dump all the tricky transformation between 'prefer' and 'local', and
just record the new nodemask of rebinding.

[feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko]
  Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com
[feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal]
  Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com

Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ben Widawsky <ben.widawsky@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>

d634f94d

!381 net: hns3: add support handling tx dhcp packets for ROH · ab866dc7

由 openeuler-ci-bot 提交于 4月 27, 2023

Merge Pull Request from: @chenke1978 
 
[Description]
For ROH distributed scenario, EID is allocated by DHCP mode.
Driver needs to convert the origin MAC address to EID format,
and updates the destination MAC, chaddr and client id(if exists)
when transmit DHCP packets. Meantime, the chaddr field should
follow the source mac address, in order to make the dhcp
server reply to the right client. For the payload of
dhcp packet changed, so the checksum of L4 should be
calculated too.

[Testing]
kernel options:
CONFIG_ROH=m
CONFIG_ROH_HNS=m

Test passed with below step:
1. Load the NIC/ROCE/ROH driver normally on the ROH device.
2. Preparing the DHCP Server and DHCP Client applications
3. Enable the DHCP service on the server node.
4. Execute the DHCP client application on the client node.
5. Check the DHCP process and wait until the DHCP IP address
   allocation is complete.
6. The communication is normal based on the new IP address.

    
 
Link:https://gitee.com/openeuler/kernel/pulls/381 

Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>

ab866dc7

net: hns3: add support handling tx dhcp packets for ROH · 19f053b7

由 Ke Chen 提交于 4月 27, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6BSMN

-----------------------------------------------------------------------

For ROH distributed scenario, EID is allocated by DHCP mode.
Driver needs to convert the origin MAC address to EID format,
and updates the destination MAC, chaddr and client id(if exists)
when transmit DHCP packets. Meantime, the chaddr field should
follow the source mac address, in order to make the dhcp
server reply to the right client. For the payload of
dhcp packet changed, so the checksum of L4 should be
calculated too.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NKe Chen <chenke54@huawei.com>

19f053b7

!617 Support geting xrcd num from firmware · 02362c9c

由 openeuler-ci-bot 提交于 4月 27, 2023

Merge Pull Request from: @stinft 
 
Support driver gets the num_xrcds and reserved_xrcds from firmware.
#I6WAZI 
 
Link:https://gitee.com/openeuler/kernel/pulls/617 

Reviewed-by: Chengchang Tang <tangchengchang@huawei.com> 
Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>

02362c9c

RDMA/hns: Support getting xrcd num from firmware · bbfeb5d8

由 Luoyouming 提交于 4月 24, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WAZI

---------------------------------------------------------------

Support driver gets the num_xrcds and reserved_xrcds from firmware.
Signed-off-by: NLuoyouming <luoyouming@huawei.com>
Signed-off-by: NChengchang Tang <tangchengchang@huawei.com>

bbfeb5d8

26 4月, 2023 23 次提交

!633 Backport CVEs and bugfixes · 4a1bd6e4

由 openeuler-ci-bot 提交于 4月 26, 2023

Merge Pull Request from: @zhangjialin11 
 
Pull new CVEs:
CVE-2023-1855
CVE-2023-2006
CVE-2023-30772
CVE-2023-1872

net bugfixes from Ziyang Xuan
mm cleanup from Ma Wupeng
timer bugfix from Yu Liao
xfs bugfixes from Guo Xuenan 
 
Link:https://gitee.com/openeuler/kernel/pulls/633 

Reviewed-by: Xie XiuQi <xiexiuqi@huawei.com> 
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>

4a1bd6e4

bonding: Fix memory leak when changing bond type to Ethernet · 21ac0106

由 Ido Schimmel 提交于 4月 26, 2023

mainline inclusion
from mainline-v6.3
commit c484fcc0
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c484fcc058bada604d7e4e5228d4affb646ddbc2

---------------------------

When a net device is put administratively up, its 'IFF_UP' flag is set
(if not set already) and a 'NETDEV_UP' notification is emitted, which
causes the 8021q driver to add VLAN ID 0 on the device. The reverse
happens when a net device is put administratively down.

When changing the type of a bond to Ethernet, its 'IFF_UP' flag is
incorrectly cleared, resulting in the kernel skipping the above process
and VLAN ID 0 being leaked [1].

Fix by restoring the flag when changing the type to Ethernet, in a
similar fashion to the restoration of the 'IFF_SLAVE' flag.

The issue can be reproduced using the script in [2], with example out
before and after the fix in [3].

[1]
unreferenced object 0xffff888103479900 (size 256):
  comm "ip", pid 329, jiffies 4294775225 (age 28.561s)
  hex dump (first 32 bytes):
    00 a0 0c 15 81 88 ff ff 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0
    [<ffffffff8406426c>] vlan_vid_add+0x30c/0x790
    [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0
    [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0
    [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150
    [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0
    [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180
    [<ffffffff8379af36>] do_setlink+0xb96/0x4060
    [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0
    [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0
    [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00
    [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440
    [<ffffffff839a738f>] netlink_unicast+0x53f/0x810
    [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90
    [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70
    [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0
unreferenced object 0xffff88810f6a83e0 (size 32):
  comm "ip", pid 329, jiffies 4294775225 (age 28.561s)
  hex dump (first 32 bytes):
    a0 99 47 03 81 88 ff ff a0 99 47 03 81 88 ff ff  ..G.......G.....
    81 00 00 00 01 00 00 00 cc cc cc cc cc cc cc cc  ................
  backtrace:
    [<ffffffff81a6051a>] kmalloc_trace+0x2a/0xe0
    [<ffffffff84064369>] vlan_vid_add+0x409/0x790
    [<ffffffff84068e21>] vlan_device_event+0x1491/0x21a0
    [<ffffffff81440c8e>] notifier_call_chain+0xbe/0x1f0
    [<ffffffff8372383a>] call_netdevice_notifiers_info+0xba/0x150
    [<ffffffff837590f2>] __dev_notify_flags+0x132/0x2e0
    [<ffffffff8375ad9f>] dev_change_flags+0x11f/0x180
    [<ffffffff8379af36>] do_setlink+0xb96/0x4060
    [<ffffffff837adf6a>] __rtnl_newlink+0xc0a/0x18a0
    [<ffffffff837aec6c>] rtnl_newlink+0x6c/0xa0
    [<ffffffff837ac64e>] rtnetlink_rcv_msg+0x43e/0xe00
    [<ffffffff839a99e0>] netlink_rcv_skb+0x170/0x440
    [<ffffffff839a738f>] netlink_unicast+0x53f/0x810
    [<ffffffff839a7fcb>] netlink_sendmsg+0x96b/0xe90
    [<ffffffff8369d12f>] ____sys_sendmsg+0x30f/0xa70
    [<ffffffff836a6d7a>] ___sys_sendmsg+0x13a/0x1e0

[2]
ip link add name t-nlmon type nlmon
ip link add name t-dummy type dummy
ip link add name t-bond type bond mode active-backup

ip link set dev t-bond up
ip link set dev t-nlmon master t-bond
ip link set dev t-nlmon nomaster
ip link show dev t-bond
ip link set dev t-dummy master t-bond
ip link show dev t-bond

ip link del dev t-bond
ip link del dev t-dummy
ip link del dev t-nlmon

[3]
Before:

12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/netlink
12: t-bond: <BROADCAST,MULTICAST,MASTER,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 46:57:39:a4:46:a2 brd ff:ff:ff:ff:ff:ff

After:

12: t-bond: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/netlink
12: t-bond: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 66:48:7b:74:b6:8a brd ff:ff:ff:ff:ff:ff

Fixes: e36b9d16 ("bonding: clean muticast addresses when device changes type")
Fixes: 75c78500 ("bonding: remap muticast addresses without using dev_close() and dev_open()")
Fixes: 9ec7eb60 ("bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change")
Reported-by: NMirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Link: https://lore.kernel.org/netdev/78a8a03b-6070-3e6b-5042-f848dab16fb8@alu.unizg.hr/Tested-by: NMirsad Goran Todorovac <mirsad.todorovac@alu.unizg.hr>
Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
Acked-by: NJay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

21ac0106

bonding: restore bond's IFF_SLAVE flag if a non-eth dev enslave fails · 5fd03f38

由 Nikolay Aleksandrov 提交于 4月 26, 2023

mainline inclusion
from mainline-v6.3-rc3
commit e667d469
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e667d469098671261d558be0cd93dca4d285ce1e

---------------------------

syzbot reported a warning[1] where the bond device itself is a slave and
we try to enslave a non-ethernet device as the first slave which fails
but then in the error path when ether_setup() restores the bond device
it also clears all flags. In my previous fix[2] I restored the
IFF_MASTER flag, but I didn't consider the case that the bond device
itself might also be a slave with IFF_SLAVE set, so we need to restore
that flag as well. Use the bond_ether_setup helper which does the right
thing and restores the bond's flags properly.

Steps to reproduce using a nlmon dev:
 $ ip l add nlmon0 type nlmon
 $ ip l add bond1 type bond
 $ ip l add bond2 type bond
 $ ip l set bond1 master bond2
 $ ip l set dev nlmon0 master bond1
 $ ip -d l sh dev bond1
 22: bond1: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noqueue master bond2 state DOWN mode DEFAULT group default qlen 1000
 (now bond1's IFF_SLAVE flag is gone and we'll hit a warning[3] if we
  try to delete it)

[1] https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3ef
[2] commit 7d5cd2ce ("bonding: correctly handle bonding type change on enslave failure")
[3] example warning:
 [   27.008664] bond1: (slave nlmon0): The slave device specified does not support setting the MAC address
 [   27.008692] bond1: (slave nlmon0): Error -95 calling set_mac_address
 [   32.464639] bond1 (unregistering): Released all slaves
 [   32.464685] ------------[ cut here ]------------
 [   32.464686] WARNING: CPU: 1 PID: 2004 at net/core/dev.c:10829 unregister_netdevice_many+0x72a/0x780
 [   32.464694] Modules linked in: br_netfilter bridge bonding virtio_net
 [   32.464699] CPU: 1 PID: 2004 Comm: ip Kdump: loaded Not tainted 5.18.0-rc3+ #47
 [   32.464703] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.1-2.fc37 04/01/2014
 [   32.464704] RIP: 0010:unregister_netdevice_many+0x72a/0x780
 [   32.464707] Code: 99 fd ff ff ba 90 1a 00 00 48 c7 c6 f4 02 66 96 48 c7 c7 20 4d 35 96 c6 05 fa c7 2b 02 01 e8 be 6f 4a 00 0f 0b e9 73 fd ff ff <0f> 0b e9 5f fd ff ff 80 3d e3 c7 2b 02 00 0f 85 3b fd ff ff ba 59
 [   32.464710] RSP: 0018:ffffa006422d7820 EFLAGS: 00010206
 [   32.464712] RAX: ffff8f6e077140a0 RBX: ffffa006422d7888 RCX: 0000000000000000
 [   32.464714] RDX: ffff8f6e12edbe58 RSI: 0000000000000296 RDI: ffffffff96d4a520
 [   32.464716] RBP: ffff8f6e07714000 R08: ffffffff96d63600 R09: ffffa006422d7728
 [   32.464717] R10: 0000000000000ec0 R11: ffffffff9698c988 R12: ffff8f6e12edb140
 [   32.464719] R13: dead000000000122 R14: dead000000000100 R15: ffff8f6e12edb140
 [   32.464723] FS:  00007f297c2f1740(0000) GS:ffff8f6e5d900000(0000) knlGS:0000000000000000
 [   32.464725] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [   32.464726] CR2: 00007f297bf1c800 CR3: 00000000115e8000 CR4: 0000000000350ee0
 [   32.464730] Call Trace:
 [   32.464763]  <TASK>
 [   32.464767]  rtnl_dellink+0x13e/0x380
 [   32.464776]  ? cred_has_capability.isra.0+0x68/0x100
 [   32.464780]  ? __rtnl_unlock+0x33/0x60
 [   32.464783]  ? bpf_lsm_capset+0x10/0x10
 [   32.464786]  ? security_capable+0x36/0x50
 [   32.464790]  rtnetlink_rcv_msg+0x14e/0x3b0
 [   32.464792]  ? _copy_to_iter+0xb1/0x790
 [   32.464796]  ? post_alloc_hook+0xa0/0x160
 [   32.464799]  ? rtnl_calcit.isra.0+0x110/0x110
 [   32.464802]  netlink_rcv_skb+0x50/0xf0
 [   32.464806]  netlink_unicast+0x216/0x340
 [   32.464809]  netlink_sendmsg+0x23f/0x480
 [   32.464812]  sock_sendmsg+0x5e/0x60
 [   32.464815]  ____sys_sendmsg+0x22c/0x270
 [   32.464818]  ? import_iovec+0x17/0x20
 [   32.464821]  ? sendmsg_copy_msghdr+0x59/0x90
 [   32.464823]  ? do_set_pte+0xa0/0xe0
 [   32.464828]  ___sys_sendmsg+0x81/0xc0
 [   32.464832]  ? mod_objcg_state+0xc6/0x300
 [   32.464835]  ? refill_obj_stock+0xa9/0x160
 [   32.464838]  ? memcg_slab_free_hook+0x1a5/0x1f0
 [   32.464842]  __sys_sendmsg+0x49/0x80
 [   32.464847]  do_syscall_64+0x3b/0x90
 [   32.464851]  entry_SYSCALL_64_after_hwframe+0x44/0xae
 [   32.464865] RIP: 0033:0x7f297bf2e5e7
 [   32.464868] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
 [   32.464869] RSP: 002b:00007ffd96c824c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 [   32.464872] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f297bf2e5e7
 [   32.464874] RDX: 0000000000000000 RSI: 00007ffd96c82540 RDI: 0000000000000003
 [   32.464875] RBP: 00000000640f19de R08: 0000000000000001 R09: 000000000000007c
 [   32.464876] R10: 00007f297bffabe0 R11: 0000000000000246 R12: 0000000000000001
 [   32.464877] R13: 00007ffd96c82d20 R14: 00007ffd96c82610 R15: 000055bfe38a7020
 [   32.464881]  </TASK>
 [   32.464882] ---[ end trace 0000000000000000 ]---

Fixes: 7d5cd2ce ("bonding: correctly handle bonding type change on enslave failure")
Reported-by: syzbot+9dfc3f3348729cc82277@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=391c7b1f6522182899efba27d891f1743e8eb3efSigned-off-by: NNikolay Aleksandrov <razor@blackwall.org>
Reviewed-by: NMichal Kubiak <michal.kubiak@intel.com>
Acked-by: NJonathan Toppins <jtoppins@redhat.com>
Acked-by: NJay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

5fd03f38

bonding: restore IFF_MASTER/SLAVE flags on bond enslave ether type change · c8f93faf

由 Nikolay Aleksandrov 提交于 4月 26, 2023

mainline inclusion
from mainline-v6.3-rc3
commit 9ec7eb60
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WNGK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9ec7eb60dcbcb6c41076defbc5df7bbd95ceaba5

---------------------------

Add bond_ether_setup helper which is used to fix ether_setup() calls in the
bonding driver. It takes care of both IFF_MASTER and IFF_SLAVE flags, the
former is always restored and the latter only if it was set.
If the bond enslaves non-ARPHRD_ETHER device (changes its type), then
releases it and enslaves ARPHRD_ETHER device (changes back) then we
use ether_setup() to restore the bond device type but it also resets its
flags and removes IFF_MASTER and IFF_SLAVE[1]. Use the bond_ether_setup
helper to restore both after such transition.

[1] reproduce (nlmon is non-ARPHRD_ETHER):
 $ ip l add nlmon0 type nlmon
 $ ip l add bond2 type bond mode active-backup
 $ ip l set nlmon0 master bond2
 $ ip l set nlmon0 nomaster
 $ ip l add bond1 type bond
 (we use bond1 as ARPHRD_ETHER device to restore bond2's mode)
 $ ip l set bond1 master bond2
 $ ip l sh dev bond2
 37: bond2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether be:d7:c5:40:5b:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 1500
 (notice bond2's IFF_MASTER is missing)

Fixes: e36b9d16 ("bonding: clean muticast addresses when device changes type")
Signed-off-by: NNikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Conflicts:
	drivers/net/bonding/bond_main.c
Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

c8f93faf

hwmon: (xgene) Fix use after free bug in xgene_hwmon_remove due to race condition · cd319e5a

由 Zheng Wang 提交于 4月 26, 2023

mainline inclusion
from mainline-v6.3-rc3
commit cb090e64
category: bugfix
bugzilla: 188657, https://gitee.com/src-openeuler/kernel/issues/I6T36A
CVE: CVE-2023-1855

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb090e64cf25602b9adaf32d5dfc9c8bec493cd1

--------------------------------

In xgene_hwmon_probe, &ctx->workq is bound with xgene_hwmon_evt_work.
Then it will be started.

If we remove the driver which will call xgene_hwmon_remove to clean up,
there may be unfinished work.

The possible sequence is as follows:

Fix it by finishing the work before cleanup in xgene_hwmon_remove.

CPU0                  CPU1

                    |xgene_hwmon_evt_work
xgene_hwmon_remove   |
kfifo_free(&ctx->async_msg_fifo);|
                    |
                    |kfifo_out_spinlocked
                    |//use &ctx->async_msg_fifo
Fixes: 2ca492e2 ("hwmon: (xgene) Fix crash when alarm occurs before driver probe")
Signed-off-by: NZheng Wang <zyytlz.wz@163.com>
Link: https://lore.kernel.org/r/20230310084007.1403388-1-zyytlz.wz@163.comSigned-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NZhao Wenhui <zhaowenhui8@huawei.com>
Reviewed-by: Nsongping yu <yusongping@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

cd319e5a

rxrpc: Fix race between conn bundle lookup and bundle removal [ZDI-CAN-15975] · 77fd5103

由 David Howells 提交于 4月 26, 2023

stable inclusion
from stable-v5.10.157
commit 3535c632e6d16c98f76e615da8dc0cb2750c66cc
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6VK2H
CVE: CVE-2023-2006

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3535c632e6d16c98f76e615da8dc0cb2750c66cc

--------------------------------

[ Upstream commit 3bcd6c7e ]

After rxrpc_unbundle_conn() has removed a connection from a bundle, it
checks to see if there are any conns with available channels and, if not,
removes and attempts to destroy the bundle.

Whilst it does check after grabbing client_bundles_lock that there are no
connections attached, this races with rxrpc_look_up_bundle() retrieving the
bundle, but not attaching a connection for the connection to be attached
later.

There is therefore a window in which the bundle can get destroyed before we
manage to attach a new connection to it.

Fix this by adding an "active" counter to struct rxrpc_bundle:

 (1) rxrpc_connect_call() obtains an active count by prepping/looking up a
     bundle and ditches it before returning.

 (2) If, during rxrpc_connect_call(), a connection is added to the bundle,
     this obtains an active count, which is held until the connection is
     discarded.

 (3) rxrpc_deactivate_bundle() is created to drop an active count on a
     bundle and destroy it when the active count reaches 0.  The active
     count is checked inside client_bundles_lock() to prevent a race with
     rxrpc_look_up_bundle().

 (4) rxrpc_unbundle_conn() then calls rxrpc_deactivate_bundle().

Fixes: 245500d8 ("rxrpc: Rewrite the client connection manager")
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-15975
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Tested-by: zdi-disclosures@trendmicro.com
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

Conflicts:
	net/rxrpc/ar-internal.h
	net/rxrpc/conn_client.c
Signed-off-by: NWang Yufen <wangyufen@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

77fd5103

power: supply: da9150: Fix use after free bug in da9150_charger_remove due to race condition · ff0a5f1f

由 Zheng Wang 提交于 4月 26, 2023

stable inclusion
from stable-v5.10.177
commit 75e2144291e847009fbc0350e10ec588ff96e05a
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6W80A
CVE: CVE-2023-30772

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=75e2144291e847009fbc0350e10ec588ff96e05a

--------------------------------

[ Upstream commit 06615d11 ]

In da9150_charger_probe, &charger->otg_work is bound with
da9150_charger_otg_work. da9150_charger_otg_ncb may be
called to start the work.

If we remove the module which will call da9150_charger_remove
to make cleanup, there may be a unfinished work. The possible
sequence is as follows:

Fix it by canceling the work before cleanup in the da9150_charger_remove

CPU0                  CPUc1

                    |da9150_charger_otg_work
da9150_charger_remove      |
power_supply_unregister  |
device_unregister   |
power_supply_dev_release|
kfree(psy)          |
                    |
                    | 	power_supply_changed(charger->usb);
                    |   //use

Fixes: c1a281e3 ("power: Add support for DA9150 Charger")
Signed-off-by: NZheng Wang <zyytlz.wz@163.com>
Signed-off-by: NSebastian Reichel <sebastian.reichel@collabora.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

ff0a5f1f

mm: mem_reliable: Fix blank space issue in reliable_report_usage() · 1e6de5a9

由 Ma Wupeng 提交于 4月 26, 2023

hulk inclusion
category: cleanup
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WKXZ
CVE: NA

--------------------------------

The blank space before kB is needed to align the previous memory report
style.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

1e6de5a9

timers/nohz: Last resort update jiffies on nohz_full IRQ entry · 2e9bafbf

由 Frederic Weisbecker 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.16-rc4
commit 53e87e3c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6WCC1
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53e87e3cdc155f20c3417b689df8d2ac88d79576

--------------------------------

When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.

Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.

If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.

Here is a scenario where it matters:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop machine callback is queued to execute somewhere.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
last_jiffies_update as a base. But last_jiffies_update hasn't been
updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
that the expiration is in the past and therefore programs the min
delta event on the clock.

6) The tick fires immediately, goto 3)

7) Tick storm, the nohz_full CPU is drown and takes ages to reach
MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.

Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.
Reported-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NPaul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org

Conflicts:
kernel/softirq.c
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

2e9bafbf

xfs: don't leak btree cursor when insrec fails after a split · 9b5e83a1

由 Darrick J. Wong 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.18-rc2
commit a54f78de
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a54f78def73d847cb060b18c4e4a3d1d26c9ca6d

--------------------------------

The recent patch to improve btree cycle checking caused a regression
when I rebased the in-memory btree branch atop the 5.19 for-next branch,
because in-memory short-pointer btrees do not have AG numbers.  This
produced the following complaint from kmemleak:

unreferenced object 0xffff88803d47dde8 (size 264):
  comm "xfs_io", pid 4889, jiffies 4294906764 (age 24.072s)
  hex dump (first 32 bytes):
    90 4d 0b 0f 80 88 ff ff 00 a0 bd 05 80 88 ff ff  .M..............
    e0 44 3a a0 ff ff ff ff 00 df 08 06 80 88 ff ff  .D:.............
  backtrace:
    [<ffffffffa0388059>] xfbtree_dup_cursor+0x49/0xc0 [xfs]
    [<ffffffffa029887b>] xfs_btree_dup_cursor+0x3b/0x200 [xfs]
    [<ffffffffa029af5d>] __xfs_btree_split+0x6ad/0x820 [xfs]
    [<ffffffffa029b130>] xfs_btree_split+0x60/0x110 [xfs]
    [<ffffffffa029f6da>] xfs_btree_make_block_unfull+0x19a/0x1f0 [xfs]
    [<ffffffffa029fada>] xfs_btree_insrec+0x3aa/0x810 [xfs]
    [<ffffffffa029fff3>] xfs_btree_insert+0xb3/0x240 [xfs]
    [<ffffffffa02cb729>] xfs_rmap_insert+0x99/0x200 [xfs]
    [<ffffffffa02cf142>] xfs_rmap_map_shared+0x192/0x5f0 [xfs]
    [<ffffffffa02cf60b>] xfs_rmap_map_raw+0x6b/0x90 [xfs]
    [<ffffffffa0384a85>] xrep_rmap_stash+0xd5/0x1d0 [xfs]
    [<ffffffffa0384dc0>] xrep_rmap_visit_bmbt+0xa0/0xf0 [xfs]
    [<ffffffffa0384fb6>] xrep_rmap_scan_iext+0x56/0xa0 [xfs]
    [<ffffffffa03850d8>] xrep_rmap_scan_ifork+0xd8/0x160 [xfs]
    [<ffffffffa0385195>] xrep_rmap_scan_inode+0x35/0x80 [xfs]
    [<ffffffffa03852ee>] xrep_rmap_find_rmaps+0x10e/0x270 [xfs]

I noticed that xfs_btree_insrec has a bunch of debug code that return
out of the function immediately, without freeing the "new" btree cursor
that can be returned when _make_block_unfull calls xfs_btree_split.  Fix
the error return in this function to free the btree cursor.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

9b5e83a1

xfs: avoid unnecessary runtime sibling pointer endian conversions · 0904505c

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.18-rc2
commit 5672225e
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5672225e8f2a872a22b0cecedba7a6644af1fb84

--------------------------------

Commit dc04db2a has caused a small aim7 regression, showing a
small increase in CPU usage in __xfs_btree_check_sblock() as a
result of the extra checking.

This is likely due to the endian conversion of the sibling poitners
being unconditional instead of relying on the compiler to endian
convert the NULL pointer at compile time and avoiding the runtime
conversion for this common case.

Rework the checks so that endian conversion of the sibling pointers
is only done if they are not null as the original code did.

.... and these need to be "inline" because the compiler completely
fails to inline them automatically like it should be doing.

$ size fs/xfs/libxfs/xfs_btree.o*
   text	   data	    bss	    dec	    hex	filename
  51874	    240	      0	  52114	   cb92 fs/xfs/libxfs/xfs_btree.o.orig
  51562	    240	      0	  51802	   ca5a fs/xfs/libxfs/xfs_btree.o.inline

Just when you think the tools have advanced sufficiently we don't
have to care about stuff like this anymore, along comes a reminder
that *our tools still suck*.

Fixes: dc04db2a ("xfs: detect self referencing btree sibling pointers")
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

0904505c

xfs: detect self referencing btree sibling pointers · cf46b919

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.18-rc2
commit dc04db2a
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dc04db2aa7c9307e740d6d0e173085301c173b1a

--------------------------------

To catch the obvious graph cycle problem and hence potential endless
looping.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDave Chinner <david@fromorbit.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

cf46b919

xfs: introduce xfs_buf_daddr() · 9b2b577e

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.14-rc4
commit 04fcad80
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=04fcad80cd068731a779fb442f78234732683755

--------------------------------

Introduce a helper function xfs_buf_daddr() to extract the disk
address of the buffer from the struct xfs_buf. This will replace
direct accesses to bp->b_bn and bp->b_maps[0].bm_bn, as well as
the XFS_BUF_ADDR() macro.

This patch introduces the helper function and replaces all uses of
XFS_BUF_ADDR() as this is just a simple sed replacement.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

9b2b577e

xfs: move kernel-specific superblock validation out of libxfs · 31fdbd85

由 Darrick J. Wong 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.10-rc5
commit 3945ae03
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3945ae03d822aa47584dd502ac024ae1e1eb9e2d

--------------------------------

A couple of the superblock validation checks apply only to the kernel,
so move them to xfs_fc_fill_super before we add the needsrepair "feature",
which will prevent the kernel (but not xfsprogs) from mounting the
filesystem.  This also reduces the diff between kernel and userspace
libxfs.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

31fdbd85

xfs: bound maximum wait time for inodegc work · cfb17508

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.19-rc2
commit 7cf2b0f9
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7cf2b0f9611b9971d663e1fc3206eeda3b902922

--------------------------------

Currently inodegc work can sit queued on the per-cpu queue until
the workqueue is either flushed of the queue reaches a depth that
triggers work queuing (and later throttling). This means that we
could queue work that waits for a long time for some other event to
trigger flushing.

Hence instead of just queueing work at a specific depth, use a
delayed work that queues the work at a bound time. We can still
schedule the work immediately at a given depth, but we no long need
to worry about leaving a number of items on the list that won't get
processed until external events prevail.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

cfb17508

xfs: introduce xfs_inodegc_push() · cc10c7d9

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.19-rc2
commit 5e672cd6
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5e672cd69f0a534a445df4372141fd0d1d00901d

--------------------------------

The current blocking mechanism for pushing the inodegc queue out to
disk can result in systems becoming unusable when there is a long
running inodegc operation. This is because the statfs()
implementation currently issues a blocking flush of the inodegc
queue and a significant number of common system utilities will call
statfs() to discover something about the underlying filesystem.

This can result in userspace operations getting stuck on inodegc
progress, and when trying to remove a heavily reflinked file on slow
storage with a full journal, this can result in delays measuring in
hours.

Avoid this problem by adding "push" function that expedites the
flushing of the inodegc queue, but doesn't wait for it to complete.

Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this
mechanism so they don't block but still ensure that queued
operations are expedited.

Fixes: ab23a776 ("xfs: per-cpu deferred inode inactivation queues")
Reported-by: NChris Dunlop <chris@onthe.net.au>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
[djwong: fix _getquota_next to use _inodegc_push too]
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

cc10c7d9

xfs: flush inodegc workqueue tasks before cancel · ad0a103e

由 Brian Foster 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.16-rc5
commit 6191cf3a
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6191cf3ad59fda5901160633fef8e41b064a5246

--------------------------------

The xfs_inodegc_stop() helper performs a high level flush of pending
work on the percpu queues and then runs a cancel_work_sync() on each
of the percpu work tasks to ensure all work has completed before
returning.  While cancel_work_sync() waits for wq tasks to complete,
it does not guarantee work tasks have started. This means that the
_stop() helper can queue and instantly cancel a wq task without
having completed the associated work. This can be observed by
tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>"
test:

	xfs_destroy_inode: ... ino 0x83 ...
	xfs_inode_set_need_inactive: ... ino 0x83 ...
	xfs_inodegc_stop: ...
	...
	xfs_inodegc_start: ...
	xfs_inodegc_worker: ...
	xfs_inode_inactivating: ... ino 0x83 ...

The first few lines show that the inode is removed and need inactive
state set, but the inactivation work has not completed before the
inodegc mechanism stops. The inactivation doesn't actually occur
until the fs is unfrozen and the gc mechanism starts back up. Note
that this test requires fsfreeze to reproduce because xfs_freeze
indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush().

When this occurs, the workqueue try_to_grab_pending() logic first
tries to steal the pending bit, which does not succeed because the
bit has been set by queue_work_on(). Subsequently, it checks for
association of a pool workqueue from the work item under the pool
lock. This association is set at the point a work item is queued and
cleared when dequeued for processing. If the association exists, the
work item is removed from the queue and cancel_work_sync() returns
true. If the pwq association is cleared, the remove attempt assumes
the task is busy and retries (eventually returning false to the
caller after waiting for the work task to complete).

To avoid this race, we can flush each work item explicitly before
cancel. However, since the _queue_all() already schedules each
underlying work item, the workqueue level helpers are sufficient to
achieve the same ordering effect. E.g., the inodegc enabled flag
prevents scheduling any further work in the _stop() case. Use the
drain_workqueue() helper in this particular case to make the intent
a bit more self explanatory.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

ad0a103e

xfs: drop async cache flushes from CIL commits. · 9912b411

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.17-rc6
commit 919edbad
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=919edbadebe17a67193533f531c2920c03e40fa4

--------------------------------

Jan Kara reported a performance regression in dbench that he
bisected down to commit bad77c37 ("xfs: CIL checkpoint
flushes caches unconditionally").

Whilst developing the journal flush/fua optimisations this cache was
part of, it appeared to made a significant difference to
performance. However, now that this patchset has settled and all the
correctness issues fixed, there does not appear to be any
significant performance benefit to asynchronous cache flushes.

In fact, the opposite is true on some storage types and workloads,
where additional cache flushes that can occur from fsync heavy
workloads have measurable and significant impact on overall
throughput.

Local dbench testing shows little difference on dbench runs with
sync vs async cache flushes on either fast or slow SSD storage, and
no difference in streaming concurrent async transaction workloads
like fs-mark.

Fast NVME storage.

From `dbench -t 30`, CIL scale:

clients		async			sync
		BW	Latency		BW	Latency
1		 935.18   0.855		 915.64   0.903
8		2404.51   6.873		2341.77   6.511
16		3003.42   6.460		2931.57   6.529
32		3697.23   7.939		3596.28   7.894
128		7237.43  15.495		7217.74  11.588
512		5079.24  90.587		5167.08  95.822

fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize

	create		chown		unlink
async   1m41s		1m16s		2m03s
sync	1m40s		1m19s		1m54s

Slower SATA SSD storage:

From `dbench -t 30`, CIL scale:

clients		async			sync
		BW	Latency		BW	Latency
1		  78.59  15.792		  83.78  10.729
8		 367.88  92.067		 404.63  59.943
16		 564.51  72.524		 602.71  76.089
32		 831.66 105.984		 870.26 110.482
128		1659.76 102.969		1624.73  91.356
512		2135.91 223.054		2603.07 161.160

fsmark, 16 threads, create w/32k logbsize

	create		unlink
async   5m06s		4m15s
sync	5m00s		4m22s

And on Jan's test machine:

                   5.18-rc8-vanilla       5.18-rc8-patched
Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*

Performance and cache flush behaviour is restored to pre-regression
levels.

As such, we can now consider the async cache flush mechanism an
unnecessary exercise in premature optimisation and hence we can
now remove it and the infrastructure it requires completely.

Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally")
Reported-and-tested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

9912b411

xfs: limit iclog tail updates · 3e5da037

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.14-rc1
commit 9d110014
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9d110014205cb1129fa570d8de83d486fa199354

--------------------------------

From the department of "generic/482 keeps on giving", we bring you
another tail update race condition:

iclog:
S1 C1
+-----------------------+-----------------------+
S2 EOIC

Two checkpoints in a single iclog. One is complete, the other just
contains the start record and overruns into a new iclog.

Timeline:

Before S1: Cache flush, log tail = X
At S1: Metadata stable, write start record and checkpoint
At C1: Write commit record, set NEED_FUA
Single iclog checkpoint, so no need for NEED_FLUSH
Log tail still = X, so no need for NEED_FLUSH

After C1,
Before S2: Cache flush, log tail = X
At S2: Metadata stable, write start record and checkpoint
After S2: Log tail moves to X+1
At EOIC: End of iclog, more journal data to write
Releases iclog
Not a commit iclog, so no need for NEED_FLUSH
Writes log tail X+1 into iclog.

At this point, the iclog has tail X+1 and NEED_FUA set. There has
been no cache flush for the metadata between X and X+1, and the
iclog writes the new tail permanently to the log. THis is sufficient
to violate on disk metadata/journal ordering.

We have two options here. The first is to detect this case in some
manner and ensure that the partial checkpoint write sets NEED_FLUSH
when the iclog is already marked NEED_FUA and the log tail changes.
This seems somewhat fragile and quite complex to get right, and it
doesn't actually make it obvious what underlying problem it is
actually addressing from reading the code.

The second option seems much cleaner to me, because it is derived
directly from the requirements of the C1 commit record in the iclog.
That is, when we write this commit record to the iclog, we've
guaranteed that the metadata/data ordering is correct for tail
update purposes. Hence if we only write the log tail into the iclog
for the *first* commit record rather than the log tail at the last
release, we guarantee that the log tail does not move past where the
the first commit record in the log expects it to be.

IOWs, taking the first option means that replay of C1 becomes
dependent on future operations doing the right thing, not just the
C1 checkpoint itself doing the right thing. This makes log recovery
almost impossible to reason about because now we have to take into
account what might or might not have happened in the future when
looking at checkpoints in the log rather than just having to
reconstruct the past...
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

3e5da037

xfs: need to see iclog flags in tracing · 5d0f7dd2

由 Dave Chinner 提交于 4月 26, 2023

mainline inclusion
from mainline-v5.14-rc1
commit b2ae3a9e
category: bugfix
bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b2ae3a9ef91152931b99620c431cf3805daa1429

--------------------------------

Because I cannot tell if the NEED_FLUSH flag is being set correctly
by the log force and CIL push machinery without it.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

5d0f7dd2

io_uring: ensure that io_init_req() passes in the right issue_flags · eb75cbd7

由 Jens Axboe 提交于 4月 26, 2023

stable inclusion
from stable-v5.10.172
commit da24142b1ef9fd5d36b76e36bab328a5b27523e8
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1
CVE: CVE-2023-1872

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=da24142b1ef9fd5d36b76e36bab328a5b27523e8

--------------------------------

We can't use 0 here, as io_init_req() is always invoked with the
ctx uring_lock held. Newer kernels have IO_URING_F_UNLOCKED for this,
but previously we used IO_URING_F_NONBLOCK to indicate this as well.

Fixes: 08681391b84d ("io_uring: add missing lock in io_get_file_fixed")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

eb75cbd7

io_uring: add missing lock in io_get_file_fixed · d5f06f27

由 Bing-Jhong Billy Jheng 提交于 4月 26, 2023

stable inclusion
from stable-v5.10.171
commit 08681391b84da27133deefaaddefd0acfa90c2be
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6V7V1
CVE: CVE-2023-1872

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=08681391b84da27133deefaaddefd0acfa90c2be

--------------------------------

io_get_file_fixed will access io_uring's context. Lock it if it is
invoked unlocked (eg via io-wq) to avoid a race condition with fixed
files getting unregistered.

No single upstream patch exists for this issue, it was fixed as part
of the file assignment changes that went into the 5.18 cycle.
Signed-off-by: NJheng, Bing-Jhong Billy <billy@starlabs.sg>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

d5f06f27

!618 Bugfixes related to SAS error handling, DIF, and low power consumption · 9895e9ad

由 openeuler-ci-bot 提交于 4月 26, 2023

Merge Pull Request from: @xia-bing1

Resolve the following issues:
1.SATA devices on an expander may be removed and not be found again when I_T nexus reset and revalidation are processed simultaneously.
2.Currently the driver sets the port invalid if one phy in the port is not enabled, which may cause issues in expander situation. In directly attached situation, if phy up doesn't occur in time when refreshing port id, the port is incorrectly set to invalid which will also cause disk lost.
3.When the current status of the host controller is suspended, enabling a local PHY just after disabling all local PHYs in expander envirnment, a hung as follows occurs.
4.incorrect port id may be configured in hisi_sas_refresh_port_id().As a result, all the internal IOs fail and disk lost,
5.After a HUAWEI disk that supports DIF3 is converted to a common SAS disk in DIF format, an error message is displayed when the FIO command is executed.

Link:https://gitee.com/openeuler/kernel/pulls/618

Reviewed-by: Yihang Li <liyihang9@huawei.com>
Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>

9895e9ad

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功