提交 · 96a28d8d6e879e34851080007f81109ffbfca69d · openeuler / Kernel

25 10月, 2021 3 次提交

net/802/garp: fix memleak in garp_request_join() · 96a28d8d

由 Yang Yingliang 提交于 10月 25, 2021

stable inclusion
from linux-4.19.200
commit e954107513e5e984821591b9b0ee4b002fcb63c6

--------------------------------

[ Upstream commit 42ca63f9 ]

I got kmemleak report when doing fuzz test:

BUG: memory leak
unreferenced object 0xffff88810c909b80 (size 64):
  comm "syz", pid 957, jiffies 4295220394 (age 399.090s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 08 00 00 00 01 02 00 04  ................
  backtrace:
    [<00000000ca1f2e2e>] garp_request_join+0x285/0x3d0
    [<00000000bf153351>] vlan_gvrp_request_join+0x15b/0x190
    [<0000000024005e72>] vlan_dev_open+0x706/0x980
    [<00000000dc20c4d4>] __dev_open+0x2bb/0x460
    [<0000000066573004>] __dev_change_flags+0x501/0x650
    [<0000000035b42f83>] rtnl_configure_link+0xee/0x280
    [<00000000a5e69de0>] __rtnl_newlink+0xed5/0x1550
    [<00000000a5258f4a>] rtnl_newlink+0x66/0x90
    [<00000000506568ee>] rtnetlink_rcv_msg+0x439/0xbd0
    [<00000000b7eaeae1>] netlink_rcv_skb+0x14d/0x420
    [<00000000c373ce66>] netlink_unicast+0x550/0x750
    [<00000000ec74ce74>] netlink_sendmsg+0x88b/0xda0
    [<00000000381ff246>] sock_sendmsg+0xc9/0x120
    [<000000008f6a2db3>] ____sys_sendmsg+0x6e8/0x820
    [<000000008d9c1735>] ___sys_sendmsg+0x145/0x1c0
    [<00000000aa39dd8b>] __sys_sendmsg+0xfe/0x1d0

Calling garp_request_leave() after garp_request_join(), the attr->state
is set to GARP_APPLICANT_VO, garp_attr_destroy() won't be called in last
transmit event in garp_uninit_applicant(), the attr of applicant will be
leaked. To fix this leak, iterate and free each attr of applicant before
rerturning from garp_uninit_applicant().
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

96a28d8d

net/802/mrp: fix memleak in mrp_request_join() · 4c2871d4

由 Yang Yingliang 提交于 10月 25, 2021

stable inclusion
from linux-4.19.200
commit f9dd1e4e9d39e799fbe2be9ac7e6b43a9567ff8c

--------------------------------

[ Upstream commit 996af621 ]

I got kmemleak report when doing fuzz test:

BUG: memory leak
unreferenced object 0xffff88810c239500 (size 64):
comm "syz-executor940", pid 882, jiffies 4294712870 (age 14.631s)
hex dump (first 32 bytes):
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 01 00 00 00 01 02 00 04 ................
backtrace:
[<00000000a323afa4>] slab_alloc_node mm/slub.c:2972 [inline]
[<00000000a323afa4>] slab_alloc mm/slub.c:2980 [inline]
[<00000000a323afa4>] __kmalloc+0x167/0x340 mm/slub.c:4130
[<000000005034ca11>] kmalloc include/linux/slab.h:595 [inline]
[<000000005034ca11>] mrp_attr_create net/802/mrp.c:276 [inline]
[<000000005034ca11>] mrp_request_join+0x265/0x550 net/802/mrp.c:530
[<00000000fcfd81f3>] vlan_mvrp_request_join+0x145/0x170 net/8021q/vlan_mvrp.c:40
[<000000009258546e>] vlan_dev_open+0x477/0x890 net/8021q/vlan_dev.c:292
[<0000000059acd82b>] __dev_open+0x281/0x410 net/core/dev.c:1609
[<000000004e6dc695>] __dev_change_flags+0x424/0x560 net/core/dev.c:8767
[<00000000471a09af>] rtnl_configure_link+0xd9/0x210 net/core/rtnetlink.c:3122
[<0000000037a4672b>] __rtnl_newlink+0xe08/0x13e0 net/core/rtnetlink.c:3448
[<000000008d5d0fda>] rtnl_newlink+0x64/0xa0 net/core/rtnetlink.c:3488
[<000000004882fe39>] rtnetlink_rcv_msg+0x369/0xa10 net/core/rtnetlink.c:5552
[<00000000907e6c54>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504
[<00000000e7d7a8c4>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
[<00000000e7d7a8c4>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340
[<00000000e0645d50>] netlink_sendmsg+0x78e/0xc90 net/netlink/af_netlink.c:1929
[<00000000c24559b7>] sock_sendmsg_nosec net/socket.c:654 [inline]
[<00000000c24559b7>] sock_sendmsg+0x139/0x170 net/socket.c:674
[<00000000fc210bc2>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350
[<00000000be4577b5>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404

Calling mrp_request_leave() after mrp_request_join(), the attr->state
is set to MRP_APPLICANT_VO, mrp_attr_destroy() won't be called in last
TX event in mrp_uninit_applicant(), the attr of applicant will be leaked.
To fix this leak, iterate and free each attr of applicant before rerturning
from mrp_uninit_applicant().
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4c2871d4

af_unix: fix garbage collect vs MSG_PEEK · faa64eb4

由 Miklos Szeredi 提交于 10月 25, 2021

stable inclusion
from linux-4.19.200
commit 1dabafa9f61118b1377fde424d9a94bf8dbf2813

--------------------------------

commit cbcf0112 upstream.

unix_gc() assumes that candidate sockets can never gain an external
reference (i.e.  be installed into an fd) while the unix_gc_lock is
held.  Except for MSG_PEEK this is guaranteed by modifying inflight
count under the unix_gc_lock.

MSG_PEEK does not touch any variable protected by unix_gc_lock (file
count is not), yet it needs to be serialized with garbage collection.
Do this by locking/unlocking unix_gc_lock:

 1) increment file count

 2) lock/unlock barrier to make sure incremented file count is visible
    to garbage collection

 3) install file into fd

This is a lock barrier (unlike smp_mb()) that ensures that garbage
collection is run completely before or completely after the barrier.

Cc: <stable@vger.kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

faa64eb4

22 10月, 2021 12 次提交

efi: Change down_interruptible() in virt_efi_reset_system() to down_trylock() · 96156c8d

由 Zhang Jianhua 提交于 10月 22, 2021

hulk inclusion
category: bugfix
bugzilla: 182892, https://gitee.com/openeuler/kernel/issues/I4F04R

--------

While reboot the system by sysrq, the following bug will be occur.

BUG: sleeping function called from invalid context at kernel/locking/semaphore.c:90
in_atomic(): 0, irqs_disabled(): 128, non_block: 0, pid: 10052, name: rc.shutdown
CPU: 3 PID: 10052 Comm: rc.shutdown Tainted: G        W O      5.10.0 #1
Call trace:
 dump_backtrace+0x0/0x1c8
 show_stack+0x18/0x28
 dump_stack+0xd0/0x110
 ___might_sleep+0x14c/0x160
 __might_sleep+0x74/0x88
 down_interruptible+0x40/0x118
 virt_efi_reset_system+0x3c/0xd0
 efi_reboot+0xd4/0x11c
 machine_restart+0x60/0x9c
 emergency_restart+0x1c/0x2c
 sysrq_handle_reboot+0x1c/0x2c
 __handle_sysrq+0xd0/0x194
 write_sysrq_trigger+0xbc/0xe4
 proc_reg_write+0xd4/0xf0
 vfs_write+0xa8/0x148
 ksys_write+0x6c/0xd8
 __arm64_sys_write+0x18/0x28
 el0_svc_common.constprop.3+0xe4/0x16c
 do_el0_svc+0x1c/0x2c
 el0_svc+0x20/0x30
 el0_sync_handler+0x80/0x17c
 el0_sync+0x158/0x180

The reason for this problem is that irq has been disabled in
machine_restart() and then it calls down_interruptible() in
virt_efi_reset_system(), which would occur sleep in irq context,
it is dangerous! Commit 99409b93("locking/semaphore: Add
might_sleep() to down_*() family") add might_sleep() in
down_interruptible(), so the bug info is here. down_trylock()
can solve this problem, cause there is no might_sleep.
Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com>
Reviewed-by: NChen Lifu <chenlifu@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Reviewed-by: NLiao Chang <liaochang1@huawei.com>
Reviewed-by: NHe Ying <heying24@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NYihang Xu <xuyihang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

96156c8d

svm: Use vma->vm_pgoff for the nid · 06c65377

由 Lijun Fang 提交于 10月 22, 2021

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4DZYE
CVE: NA

---------------------------

svm_mmap use the pgoff flag for the security requirement nid, which used
as cdm node.
Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

06c65377

Ascend/hugetlb:support alloc normal and buddy hugepage · d4bec1a8

由 Zhou Guanghui 提交于 10月 22, 2021

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I
CVE: NA

----------------------------------------------------

The current function hugetlb_alloc_hugepage implements the allocation
from static hugepages first. When the static hugepage is used up, it
attempts to apply for hugepages from buddy system. Two additional modes
are supported: static hugepages only and buddy hugepages only.
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d4bec1a8

Ascend/memcg: Use CONFIG_ASCEND_FEATURES for customized interfaces · 8a04737c

由 Zhou Guanghui 提交于 10月 22, 2021

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I
CVE: NA

-------------------------------------------------------------

The following functions are used only in the ascend scenario:
hugetlb_get_hstate,
hugetlb_alloc_hugepage,
hugetlb_insert_hugepage_pte,
hugetlb_insert_hugepage_pte_by_pa

Remove unused interface hugetlb_insert_hugepage
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8a04737c

Ascend/cdm:alloc hugepage from the specified CDM node · 63051041

由 Zhou Guanghui 提交于 5月 22, 2021

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I
CVE: NA

-------------------------------------------------------

commit(bd177f8f0548f): Only __GFP_THISNODE marked allocations
will come from the CDM node.

Therefore, when we alloc normal hugepages, if __GFP_THISNODE
is marked, hugepages can be applied for from the specified nid.
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

63051041

ascend/svm: Support pinned memory size greater than 2GB · 4f02e8f3

由 Zhou Guanghui 提交于 1月 18, 2021

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4DZ7Q
CVE: NA

-------------------------------------------------

If the memory size of a single pin operation exceeds 2GB,
the memory size required for storing pages exceeds 4MB.
In this case, the allocation using kcalloc fails.
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4f02e8f3

mm: ascend: Fix compilation error of mem_cgroup_from_css() · 36b945bb

由 Tang Yizhou 提交于 1月 08, 2021

ascend inclusion
category: bugfix
bugzilla: 46906, https://gitee.com/openeuler/kernel/issues/I4DZ7Q
CVE: NA

-------------------------------------------------

When CONFIG_MEMCG is disabled an CONFIG_MM_OWNER is enabled, we encounter
a compilation error as follows:

mm/hugepage_tuning.c: In function ‘get_mem_cgroup_from_path’:
mm/hugepage_tuning.c:130:8: error: implicit declaration of function
‘mem_cgroup_from_css’; did you mean ‘mem_cgroup_from_obj’?
[-Werror=implicit-function-declaration]
  mcg = mem_cgroup_from_css(of_css(of));
        ^~~~~~~~~~~~~~~~~~~
        mem_cgroup_from_obj
mm/hugepage_tuning.c:130:6: warning: assignment makes pointer from integer
without a cast [-Wint-conversion]
  mcg = mem_cgroup_from_css(of_css(of));

To fix it, let mm_update_next_owner() depend on CONFIG_MEMCG

Fixes: 719e31550652 ("arm64/ascend: Add auto tuning hugepage module")
Signed-off-by: NTang Yizhou <tangyizhou@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36b945bb

fuse: truncate pagecache on atomic_o_trunc · 8f220aae

由 Miklos Szeredi 提交于 10月 22, 2021

mainline inclusion
from mainline-5.14
commit 76224355
category: bugfix
bugzilla: 181107
CVE: NA
---------------------------

fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic
O_TRUNC.  This can deadlock with fuse_wait_on_page_writeback() in
fuse_launder_page() triggered by invalidate_inode_pages2().

Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a
truncate_pagecache() call.  This makes sense regardless of FOPEN_KEEP_CACHE
or fc->writeback cache, so do it unconditionally.
Reported-by: NXie Yongji <xieyongji@bytedance.com>
Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com
Fixes: e4648309 ("fuse: truncate pending writes on O_TRUNC")
Cc: <stable@vger.kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

Conflict: fs/fuse/file.c
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8f220aae

ext4: drop unnecessary journal handle in delalloc write · b5228b49

由 Zhang Yi 提交于 10月 22, 2021

mainline inclusion
from mainline-5.15-rc4
commit cc883236
category: perf
bugzilla: NA
CVE: NA
---------------------------

After we factor out the inline data write procedure from
ext4_da_write_end(), we don't need to start journal handle for the cases
of both buffer overwrite and append-write. If we need to update
i_disksize, mark_inode_dirty() do start handle and update inode buffer.
So we could just remove all the journal handle codes in the delalloc
write procedure.

After this patch, we could get a lot of performance improvement. Below
is the Unixbench comparison data test on my machine with 'Intel Xeon
Gold 5120' CPU and nvme SSD backend.

Test cmd:

  ./Run -c 56 -i 3 fstime fsbuffer fsdisk

Before this patch:

  System Benchmarks Partial Index           BASELINE       RESULT   INDEX
  File Copy 1024 bufsize 2000 maxblocks       3960.0     422965.0   1068.1
  File Copy 256 bufsize 500 maxblocks         1655.0     105077.0   634.9
  File Copy 4096 bufsize 8000 maxblocks       5800.0    1429092.0   2464.0
                                                                    ======
  System Benchmarks Index Score (Partial Only)                      1186.6

After this patch:

  System Benchmarks Partial Index           BASELINE       RESULT   INDEX
  File Copy 1024 bufsize 2000 maxblocks       3960.0     732716.0   1850.3
  File Copy 256 bufsize 500 maxblocks         1655.0     184940.0   1117.5
  File Copy 4096 bufsize 8000 maxblocks       5800.0    2427152.0   4184.7
                                                                    ======
  System Benchmarks Index Score (Partial Only)                      2053.0
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.comReviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b5228b49

ext4: factor out write end code of inline file · 04229625

由 Zhang Yi 提交于 10月 22, 2021

mainline inclusion
from mainline-5.15-rc4
commit 6984aef5
category: perf
bugzilla: NA
CVE: NA
---------------------------

Now that the inline_data file write end procedure are falled into the
common write end functions, it is not clear. Factor them out and do
some cleanup. This patch also drop ext4_da_write_inline_data_end()
and switch to use ext4_write_inline_data_end() instead because we also
need to do the same error processing if we failed to write data into
inline entry.
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com

Conflicts:
	fs/ext4/inline.c
	fs/ext4/inode.c
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

04229625

ext4: correct the error path of ext4_write_inline_data_end() · 56220a99

由 Zhang Yi 提交于 10月 22, 2021

mainline inclusion
from mainline-5.15-rc4
commit 55ce2f64
category: perf
bugzilla: NA
CVE: NA
---------------------------

Current error path of ext4_write_inline_data_end() is not correct.

Firstly, it should pass out the error value if ext4_get_inode_loc()
return fail, or else it could trigger infinite loop if we inject error
here. And then it's better to add inode to orphan list if it return fail
in ext4_journal_stop(), otherwise we could not restore inline xattr
entry after power failure. Finally, we need to reset the 'ret' value if
ext4_write_inline_data_end() return success in ext4_write_end() and
ext4_journalled_write_end(), otherwise we could not get the error return
value of ext4_journal_stop().
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com

Conflicts:
	fs/ext4/inode.c
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

56220a99

ext4: check and update i_disksize properly · a2f98fd5

由 Zhang Yi 提交于 10月 22, 2021

mainline inclusion
from mainline-5.15-rc4
commit 4df031ff
category: perf
bugzilla: NA
CVE: NA
---------------------------

After commit 3da40c7b ("ext4: only call ext4_truncate when size <=
isize"), i_disksize could always be updated to i_size in ext4_setattr(),
and we could sure that i_disksize <= i_size since holding inode lock and
if i_disksize < i_size there are delalloc writes pending in the range
upto i_size. If the end of the current write is <= i_size, there's no
need to touch i_disksize since writeback will push i_disksize upto
i_size eventually. So we can switch to check i_size instead of
i_disksize in ext4_da_write_end() when write to the end of the file.
we also could remove ext4_mark_inode_dirty() together because we defer
inode dirtying to generic_write_end() or ext4_da_write_inline_data_end().
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com

Conflicts:
	fs/ext4/inode.c
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a2f98fd5

21 10月, 2021 1 次提交

sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() · 80484812

由 Dietmar Eggemann 提交于 10月 21, 2021

mainline inclusion
from mainline-v5.12-rc1
commit 71e5f664
category: bugfix
bugzilla: 182847,https://gitee.com/openeuler/kernel/issues/I4EVBL
CVE: NA

----------------------------------------------------------

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
  build_sched_domains()
    __free_domain_allocs()
      __sdt_free() {
	...
        for_each_sd_topology(tl)
	  ...
          sd = *per_cpu_ptr(sdd->sd, j); <--
	  ...
      }
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
Tested-by: NBarry Song <song.bao.hua@hisilicon.com>
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com
Fixes: 620a6dc4 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort")
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

80484812

20 10月, 2021 9 次提交

uacce: misc fixes · de116c46

由 Yu'an Wang 提交于 10月 20, 2021

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

1.add input para check of uacce_unregister api
2.change uacce_qfrt_str to internal interface, because it is used
  just in uacce.c
Signed-off-by: NYu'an Wang <wangyuan46@huawei.com>
Reviewed-by: NLongfang Liu <liulongfang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

de116c46

mm/page_alloc: place pages to tail in __free_pages_core() · 01702464

由 David Hildenbrand 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.10-rc1
commit 7fef431b
category: feature
bugzilla: 182882
CVE: NA

__free_pages_core() is used when exposing fresh memory to the buddy during
system boot and when onlining memory in generic_online_page().

generic_online_page() is used in two cases:

1. Direct memory onlining in online_pages().
2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
   balloon and virtio-mem), when parts of a section are kept
   fake-offline to be fake-onlined later on.

In 1, we already place pages to the tail of the freelist.  Pages will be
freed to MIGRATE_ISOLATE lists first and moved to the tail of the
freelists via undo_isolate_page_range().

In 2, we currently don't implement a proper rule.  In case of virtio-mem,
where we currently always online MAX_ORDER - 1 pages, the pages will be
placed to the HEAD of the freelist - undesireable.  While the hyper-v
balloon calls generic_online_page() with single pages, usually it will
call it on successive single pages in a larger block.

The pages are fresh, so place them to the tail of the freelist and avoid
the PCP.  In __free_pages_core(), remove the now superflouos call to
set_page_refcounted() and add a comment regarding page initialization and
the refcount.

Note: In 2.  we currently don't shuffle.  If ever relevant (page shuffling
is usually of limited use in virtualized environments), we might want to
shuffle after a sequence of generic_online_page() calls in the relevant
callers.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
[Peng Liu: adjust context]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

01702464

mm/page_alloc: move pages to tail in move_to_free_list() · 5be3d693

由 David Hildenbrand 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.10-rc1
commit 293ffa5e
category: feature
bugzilla: 182882
CVE: NA

-----------------------------------------------

Whenever we move pages between freelists via move_to_free_list()/
move_freepages_block(), we don't actually touch the pages:
1. Page isolation doesn't actually touch the pages, it simply isolates
   pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
   When undoing isolation, we move the pages back to the target list.
2. Page stealing (steal_suitable_fallback()) moves free pages directly
   between lists without touching them.
3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
   free pages directly between freelists without touching them.

We already place pages to the tail of the freelists when undoing isolation
via __putback_isolated_page(), let's do it in any case (e.g., if order <=
pageblock_order) and document the behavior. To simplify, let's move the
pages to the tail for all move_to_free_list()/move_freepages_block() users.

In 2., the target list is empty, so there should be no change.  In 3., we
might observe a change, however, highatomic is more concerned about
allocations succeeding than cache hotness - if we ever realize this change
degrades a workload, we can special-case this instance and add a proper
comment.

This change results in all pages getting onlined via online_pages() to be
placed to the tail of the freelist.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
[Peng Liu: cherry-pick from 293ffa5e]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5be3d693

mm/page_alloc: place pages to tail in __putback_isolated_page() · 69e15bb4

由 David Hildenbrand 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.10-rc1
commit 47b6a24a
category: feature
bugzilla: 182882
CVE: NA

-----------------------------------------------

__putback_isolated_page() already documents that pages will be placed to
the tail of the freelist - this is, however, not the case for "order >=
MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
all existing users.

This change affects two users:
- free page reporting
- page isolation, when undoing the isolation (including memory onlining).

This behavior is desirable for pages that haven't really been touched
lately, so exactly the two users that don't actually read/write page
content, but rather move untouched pages.

The new behavior is especially desirable for memory onlining, where we
allow allocation of newly onlined pages via undo_isolate_page_range() in
online_pages().  Right now, we always place them to the head of the
freelist, resulting in undesireable behavior: Assume we add individual
memory chunks via add_memory() and online them right away to the NORMAL
zone.  We create a dependency chain of unmovable allocations e.g., via the
memmap.  The memmap of the next chunk will be placed onto previous chunks
- if the last block cannot get offlined+removed, all dependent ones cannot
get offlined+removed.  While this can already be observed with individual
DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
DLPAR).

Document that this should only be used for optimizations, and no code
should rely on this behavior for correction (if the order of the freelists
ever changes).

We won't care about page shuffling: memory onlining already properly
shuffles after onlining.  free page reporting doesn't care about
physically contiguous ranges, and there are already cases where page
isolation will simply move (physically close) free pages to (currently)
the head of the freelists via move_freepages_block() instead of shuffling.
If this becomes ever relevant, we should shuffle the whole zone when
undoing isolation of larger ranges, and after free_contig_range().
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
[Peng Liu: cherry-pick from 47b6a24a]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

69e15bb4

mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag · 22d4ccfa

由 David Hildenbrand 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.10-rc1
commit f04a5d5d
category: feature
bugzilla: 182882
CVE: NA

-----------------------------------------------

Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.

When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will
be placed onto one of the just added+onlined block.  This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependent ones.  We directly have unmovable
allocations all over the place.

This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs.  The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable.  The fresh
pages are cold, prefering to allocate others (that might be hot) also
feels to be the natural thing to do.

It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.

While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.

a) To visualize the problem, a very simple example:

Start a VM with 4GB and 8GB of virtio-mem memory:

 [root@localhost ~]# lsmem
 RANGE                                 SIZE  STATE REMOVABLE  BLOCK
 0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
 0x0000000100000000-0x000000033fffffff   9G online       yes 32-103

 Memory block size:       128M
 Total online memory:      12G
 Total offline memory:      0B

Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:

 [root@localhost ~]# lsmem
 RANGE                                  SIZE  STATE REMOVABLE   BLOCK
 0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
 0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
 0x0000000148000000-0x000000014fffffff  128M online       yes      41
 0x0000000158000000-0x000000015fffffff  128M online       yes      43
 0x0000000168000000-0x000000016fffffff  128M online       yes      45
 0x0000000178000000-0x000000017fffffff  128M online       yes      47
 0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
 0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
 0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
 0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
 0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
 0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
 0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
 0x0000000200000000-0x0000000207ffffff  128M online       yes      64
 0x0000000210000000-0x0000000217ffffff  128M online       yes      66
 0x0000000220000000-0x0000000227ffffff  128M online       yes      68
 0x0000000230000000-0x0000000237ffffff  128M online       yes      70
 0x0000000240000000-0x0000000247ffffff  128M online       yes      72
 0x0000000250000000-0x0000000257ffffff  128M online       yes      74
 0x0000000260000000-0x0000000267ffffff  128M online       yes      76
 0x0000000270000000-0x0000000277ffffff  128M online       yes      78
 0x0000000280000000-0x0000000287ffffff  128M online       yes      80
 0x0000000290000000-0x0000000297ffffff  128M online       yes      82
 0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
 0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
 0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
 0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
 0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
 0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
 0x0000000300000000-0x0000000307ffffff  128M online       yes      96
 0x0000000310000000-0x0000000317ffffff  128M online       yes      98
 0x0000000320000000-0x0000000327ffffff  128M online       yes     100
 0x0000000330000000-0x000000033fffffff  256M online       yes 102-103

 Memory block size:       128M
 Total online memory:     8.1G
 Total offline memory:      0B

With this patch set:

 [root@localhost ~]# lsmem
 RANGE                                 SIZE  STATE REMOVABLE BLOCK
 0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
 0x0000000100000000-0x000000013fffffff   1G online       yes 32-39

 Memory block size:       128M
 Total online memory:       4G
 Total offline memory:      0B

All memory can get unplugged, all memory block can get removed.  Of
course, no workload ran and the system was basically idle, but it
highlights the issue - the fairly deterministic chain of unmovable
allocations.  When a huge page for the 2MB memmap is needed, a
just-onlined 4MB page will be split.  The remaining 2MB page will be used
for the memmap of the next memory block.  So one memory block will hold
the memmap of the two following memory blocks.  Finally the pages of the
last-onlined memory block will get used for the next bigger allocations -
if any allocation is unmovable, all dependent memory blocks cannot get
unplugged and removed until that allocation is gone.

Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!

b) Experiment with memory intensive workload

I performed an experiment with an older version of this patch set (before
we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
kernel when adding it.  I then run various memory intensive workloads that
consume most system memory for a total of 45 minutes.  Once finished, I
try to unplug as much memory as possible.

With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks.  Via individual
(256MB) DIMMs 380 out of 448 added memory blocks.  (I don't have any
numbers without this patchset, but looking at the above example, it's at
most half of the 448 memory blocks for virtio-mem, and most probably none
for DIMMs).

Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.

This change also affects (besides memory onlining):
- Other users of undo_isolate_page_range(): Pages are always placed to the
  tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
  tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
   IIUC we will now usually allocate memory from lower addresses within
   a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon

This patch (of 5):

Let's prepare for additional flags and avoid long parameter lists of
bools.  Follow-up patches will also make use of the flags in
__free_pages_ok().
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
[Peng Liu: cherry-pick from f04a5d5d]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

22d4ccfa

mm: add function __putback_isolated_page · 91bac231

由 Alexander Duyck 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 624f58d8
category: feature
bugzilla: 182882
CVE: NA

-----------------------------------------------

There are cases where we would benefit from avoiding having to go through
the allocation and free cycle to return an isolated page.

Examples for this might include page poisoning in which we isolate a page
and then put it back in the free list without ever having actually
allocated it.

This will enable us to also avoid notifiers for the future free page
reporting which will need to avoid retriggering page reporting when
returning pages that have been reported on.
Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Luiz Capitulino <lcapitulino@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pagupta@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Wang <wei.w.wang@intel.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Cc: wei qi <weiqi4@huawei.com>
Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
        mm/internal.h
[Peng Liu: cherry-pick from 624f58d8]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

91bac231

mm/page_alloc.c: memory hotplug: free pages as higher order · c479b04b

由 Arun KS 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.1-rc1
commit a9cd410a
category: feature
bugzilla: 182882
CVE: NA

-----------------------------------------------

When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced.  With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times.  Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
  Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
  Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
  Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/page_alloc.c
	mm/memory_hotplug.c
[Peng Liu: adjust context]
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c479b04b

raid1: ensure write behind bio has less than BIO_MAX_VECS sectors · a2e9fa04

由 Guoqing Jiang 提交于 10月 20, 2021

mainline inclusion
from mainline-v5.15-rc1
commit 6607cd31
category: bugfix
bugzilla: 182883, https://gitee.com/openeuler/kernel/issues/I4ENHY
CVE: NA

-------------------------------------------------

We can't split write behind bio with more than BIO_MAX_VECS sectors,
otherwise the below call trace was triggered because we could allocate
oversized write behind bio later.

[ 8.097936] bvec_alloc+0x90/0xc0
[ 8.098934] bio_alloc_bioset+0x1b3/0x260
[ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1]
[ 8.100988] ? __bio_clone_fast+0xa8/0xe0
[ 8.102008] md_handle_request+0x158/0x1d0 [md_mod]
[ 8.103050] md_submit_bio+0xcd/0x110 [md_mod]
[ 8.104084] submit_bio_noacct+0x139/0x530
[ 8.105127] submit_bio+0x78/0x1d0
[ 8.106163] ext4_io_submit+0x48/0x60 [ext4]
[ 8.107242] ext4_writepages+0x652/0x1170 [ext4]
[ 8.108300] ? do_writepages+0x41/0x100
[ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4]
[ 8.110406] do_writepages+0x41/0x100
[ 8.111450] __filemap_fdatawrite_range+0xc5/0x100
[ 8.112513] file_write_and_wait_range+0x61/0xb0
[ 8.113564] ext4_sync_file+0x73/0x370 [ext4]
[ 8.114607] __x64_sys_fsync+0x33/0x60
[ 8.115635] do_syscall_64+0x33/0x40
[ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae

Thanks for the comment from Christoph.

[1]. https://bugs.archlinux.org/task/70992

Cc: stable@vger.kernel.org # v5.12+
Reported-by: NJens Stutte <jens@chianterastutte.eu>
Tested-by: NJens Stutte <jens@chianterastutte.eu>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <songliubraving@fb.com>

Conflict:
        drivers/md/raid1.c
        [ Mainline patch 6607cd31 ("raid1: ensure write behind bio has less
        than BIO_MAX_VECS sectors"), BIO_MAX_VECS is used directly, but the
        BIO_MAX_VECS was renamed previously and the corresponding patch
        a8affc03 ("block: rename BIO_MAX_PAGES to BIO_MAX_VECS") was not
        incorporated. So we modify BIO_MAX_VECS to the original BIO_MAX_PAGES.]
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a2e9fa04

blk-wbt: fix IO hang due to negative inflight counter · 02512915

由 Laibin Qiu 提交于 10月 20, 2021

hulk inclusion
category: bugfix
bugzilla: 182135, https://gitee.com/openeuler/kernel/issues/I4ENC8
CVE: NA

--------------------------

Block test reported the following stack, Some req has been watting for
wakeup in wbt_wait, and vmcore showed that wbt inflight counter is -1.
So Request cannot be awakened.

PID: 75416  TASK: ffff88836c098000  CPU: 2   COMMAND: "fsstress"
[ffff8882e59a7608] __schedule at ffffffffb2d22a25
[ffff8882e59a7720] schedule at ffffffffb2d2358f
[ffff8882e59a7738] io_schedule at ffffffffb2d23bdc
[ffff8882e59a7750] rq_qos_wait at ffffffffb2400fde
[ffff8882e59a7878] wbt_wait at ffffffffb243a051
[ffff8882e59a7910] __rq_qos_throttle at ffffffffb2400a20
[ffff8882e59a7930] blk_mq_make_request at ffffffffb23de038
[ffff8882e59a7a98] generic_make_request at ffffffffb23c393d
[ffff8882e59a7b80] submit_bio at ffffffffb23c3db8
[ffff8882e59a7c48] submit_bio_wait at ffffffffb23b3a5d
[ffff8882e59a7cf0] blkdev_issue_flush at ffffffffb23c8f4c
[ffff8882e59a7d20] ext4_sync_fs at ffffffffc06dd708 [ext4]
[ffff8882e59a7dd0] sync_filesystem at ffffffffb21e8335
[ffff8882e59a7df8] ovl_sync_fs at ffffffffc0fd853a [overlay]
[ffff8882e59a7e10] sync_fs_one_sb at ffffffffb21e8221
[ffff8882e59a7e30] iterate_supers at ffffffffb218401e
[ffff8882e59a7e70] ksys_sync at ffffffffb21e8588
[ffff8882e59a7f20] __x64_sys_sync at ffffffffb21e861f
[ffff8882e59a7f28] do_syscall_64 at ffffffffb1c06bc8
[ffff8882e59a7f50] entry_SYSCALL_64_after_hwframe at ffffffffb2e000ad
RIP: 00007f479ab13347  RSP: 00007ffd4dda9fe8  RFLAGS: 00000202
RAX: ffffffffffffffda  RBX: 0000000000000068  RCX: 00007f479ab13347
RDX: 0000000000000000  RSI: 000000003e1b142d  RDI: 0000000000000068
RBP: 0000000051eb851f   R8: 00007f479abd4034   R9: 00007f479abd40a0
R10: 0000000000000000  R11: 0000000000000202  R12: 0000000000402c20
R13: 0000000000000001  R14: 0000000000000000  R15: 7fffffffffffffff

The ->inflight counter may be negative (-1) if

1) blk-wbt was disabled when the IO was issued,
which will add inflight count.

2) blk-wbt was enabled before this IO tracked.

3) the ->inflight counter is decreased from
0 to -1 in endio().

This fixes the problem by freezing the queue while enabling wbt,
there is no inflight rq running.
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

02512915

19 10月, 2021 7 次提交

Export sysboml for bbox to use. · 01f486af

由 Xu Qiang 提交于 1月 04, 2021

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I
CVE: NA

-------------------------------------------------

Export console_flush_on_panic for bbox to use.
Signed-off-by: NXu Qiang <xuqiang36@huawei.com>
Signed-off-by: NFang Lijun <fanglijun3@huawei.com>
Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

01f486af

ovl: use a private non-persistent ino pool · 82a9468b

由 Amir Goldstein 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 4d314f78
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

There is no reason to deplete the system's global get_next_ino() pool for
overlay non-persistent inode numbers and there is no reason at all to
allocate non-persistent inode numbers for non-directories.

For non-directories, it is much better to leave i_ino the same as real
i_ino, to be consistent with st_ino/d_ino.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NZheng Liang <zhengliang6@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

82a9468b

ovl: simplify i_ino initialization · d71db6b5

由 Amir Goldstein 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 62c832ed
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

Move i_ino initialization to ovl_inode_init() to avoid the dance of setting
i_ino in ovl_fill_inode() sometimes on the first call and sometimes on the
seconds call.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NZheng Liang <zhengliang6@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d71db6b5

ovl: factor out helper ovl_get_root() · 3aea08f3

由 Amir Goldstein 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 2effc5c2
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

Allocates and initializes the root dentry and inode.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NZheng Liang <zhengliang6@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3aea08f3

ovl: fix out of date comment and unreachable code · 36518070

由 Amir Goldstein 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 735c907d
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

ovl_inode_update() is no longer called from create object code path.

Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...")
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NZheng Liang <zhengliang6@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36518070

Revert "cache: Workaround HiSilicon Taishan DC CVAU" · c1d59b03

由 Cheng Jian 提交于 10月 19, 2021

ascend inclusion
category: feature
bugzilla: 46922, https://gitee.com/openeuler/kernel/issues/I4EHJ6
CVE: NA

-------------------------------------

This reverts commit 1948fd64.

commit 1948fd64 ("cache: Workaround HiSilicon Taishan DC CVAU")
breaks the kabi symbols:
    cpu_hwcaps
    cpu_hwcap_keys

just revert it now.
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>

c1d59b03

Revert "config: disable CONFIG_HISILICON_ERRATUM_1980005 by default" · 718037cf

由 Cheng Jian 提交于 10月 19, 2021

ascend inclusion
category: feature
bugzilla: 46922, https://gitee.com/openeuler/kernel/issues/I4EHJ6
CVE: NA

-------------------------------------

This reverts commit d6ec90ce.

commit 1948fd64 ("cache: Workaround HiSilicon Taishan DC CVAU")
breaks the kabi symbols:
    cpu_hwcaps
    cpu_hwcap_keys

just revert it now.
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>

718037cf

18 10月, 2021 8 次提交

soc: aspeed: lpc-ctrl: Fix boundary check for mmap · 9cae8698

由 Iwona Winiarska 提交于 10月 18, 2021

stable inclusion
from linux-4.19.207
commit 9c8891b638319ddba9cfa330247922cd960c95b0
CVE: CVE-2021-42252

--------------------------------

commit b49a0e69 upstream.

The check mixes pages (vm_pgoff) with bytes (vm_start, vm_end) on one
side of the comparison, and uses resource address (rather than just the
resource size) on the other side of the comparison.
This can allow malicious userspace to easily bypass the boundary check and
map pages that are located outside memory-region reserved by the driver.

Fixes: 6c4e9767 ("drivers/misc: Add Aspeed LPC control driver")
Cc: stable@vger.kernel.org
Signed-off-by: NIwona Winiarska <iwona.winiarska@intel.com>
Reviewed-by: NAndrew Jeffery <andrew@aj.id.au>
Tested-by: NAndrew Jeffery <andrew@aj.id.au>
Reviewed-by: NJoel Stanley <joel@aj.id.au>
Signed-off-by: NJoel Stanley <joel@jms.id.au>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9cae8698

mmap: userswap: fix some format issues · 5e49acf6

由 Kefeng Wang 提交于 10月 18, 2021

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHP2
CVE: NA

-------------------------------------------------

Fix some format issues in mm/mmap.c.
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5e49acf6

mmap: userswap: fix memory leak in do_mmap · 538d10ba

由 Xiongfeng Wang 提交于 10月 18, 2021

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHP2
CVE: NA

-------------------------------------------------

When userswap is enabled, the memory pointed by 'pages' is not freed in
abnormal branch in do_mmap(). To fix the issue and keep do_mmap() mostly
unchanged, we rename do_mmap() to __do_mmap() and extract the memory
alloc and free code out of __do_mmap(). When __do_mmap() returns a error
value, we goto the error label to free the memory.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

538d10ba

arm64/mpam: fix the problem that the ret variable is not initialized · 24bc19e1

由 wenzhiwei11 提交于 10月 18, 2021

kylin inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHUL
CVE: NA

---------------------------------------------------

initialize the value "ret" in "schemata_list_init()"

Signed-off-by: wenzhiwei11 <wenzhiwei@kylinos.cn> # openEuler_contributor
Reviewed-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

24bc19e1

NFS: Fix a race in __nfs_list_for_each_server() · 22c956e1

由 Trond Myklebust 提交于 10月 18, 2021

mainline inclusion
from mainline-v5.7-rc4
commit 9c07b75b
category: bugfix
bugzilla: 182252
CVE: NA

-----------------------------------------------

The struct nfs_server gets put on the cl_superblocks list before
the server->super field has been initialised, in which case the
call to nfs_sb_active() will Oops. Add a check to ensure that
we skip such a list entry.

Fixes: 3c9e502b ("NFS: Add a helper nfs_client_for_each_server()")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

22c956e1

NFSv4: Clean up nfs_client_return_marked_delegations() · ad812908

由 Trond Myklebust 提交于 10月 18, 2021

mainline inclusion
from mainline-v5.7-rc1
commit af3b61bf
category: bugfix
bugzilla: 182252
CVE: NA

-----------------------------------------------

Convert it to use the nfs_client_for_each_server() helper, and
make it more efficient by skipping delegations for inodes we
know are in the process of being freed. Also improve the efficiency
of the cursor by skipping delegations that are being freed.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ad812908

NFS: Add a helper nfs_client_for_each_server() · 30814b21

由 Trond Myklebust 提交于 10月 18, 2021

mainline inclusion
from mainline-v5.7-rc1
commit 3c9e502b
category: bugfix
bugzilla: 182252
CVE: NA

-----------------------------------------------

Add a helper nfs_client_for_each_server() to iterate through all the
filesystems that are attached to a struct nfs_client, and apply
a function to all the active ones.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

30814b21

blktrace: Fix uaf in blk_trace access after removing by sysfs · 43263717

由 Zhihao Cheng 提交于 10月 18, 2021

mainline inclusion
from mainline-5.15-rc3
commit 5afedf67
category: bugfix
bugzilla: 181454
CVE: NA
---------------------------

There is an use-after-free problem triggered by following process:

      P1(sda)				P2(sdb)
			echo 0 > /sys/block/sdb/trace/enable
			  blk_trace_remove_queue
			    synchronize_rcu
			    blk_trace_free
			      relay_close
rcu_read_lock
__blk_add_trace
  trace_note_tsk
  (Iterate running_trace_list)
			        relay_close_buf
				  relay_destroy_buf
				    kfree(buf)
    trace_note(sdb's bt)
      relay_reserve
        buf->offset <- nullptr deference (use-after-free) !!!
rcu_read_unlock

[  502.714379] BUG: kernel NULL pointer dereference, address:
0000000000000010
[  502.715260] #PF: supervisor read access in kernel mode
[  502.715903] #PF: error_code(0x0000) - not-present page
[  502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0
[  502.717252] Oops: 0000 [#1] SMP
[  502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360
[  502.732872] Call Trace:
[  502.733193]  __blk_add_trace.cold+0x137/0x1a3
[  502.733734]  blk_add_trace_rq+0x7b/0xd0
[  502.734207]  blk_add_trace_rq_issue+0x54/0xa0
[  502.734755]  blk_mq_start_request+0xde/0x1b0
[  502.735287]  scsi_queue_rq+0x528/0x1140
...
[  502.742704]  sg_new_write.isra.0+0x16e/0x3e0
[  502.747501]  sg_ioctl+0x466/0x1100

Reproduce method:
  ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
  ioctl(/dev/sda, BLKTRACESTART)
  ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127])
  ioctl(/dev/sdb, BLKTRACESTART)

  echo 0 > /sys/block/sdb/trace/enable &
  // Add delay(mdelay/msleep) before kernel enters blk_trace_free()

  ioctl$SG_IO(/dev/sda, SG_IO, ...)
  // Enters trace_note_tsk() after blk_trace_free() returned
  // Use mdelay in rcu region rather than msleep(which may schedule out)

Remove blk_trace from running_list before calling blk_trace_free() by
sysfs if blk_trace is at Blktrace_running state.

Fixes: c71a8961 ("blktrace: add ftrace plugin")
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

43263717

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功