提交 · 72a95efdb3b0494f78a070631f71db63c39ac14a · openeuler / Kernel

19 1月, 2022 15 次提交

mm/huge_memory: disable THP when dynamic hugetlb is enabled · 72a95efd

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

When THP is enabled, the allocation of a page(order=0) may be converted to
an allocation of pages(order>0). In this case, the allocation will skip the
dhugetlb_pool. When we want to use dynamic hugetlb feature, we have to
disable THP for now.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

72a95efd

mm/dynamic_hugetlb: add some tracepoints · 7278ccd6

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add tracepoints for dynamic_hugetlb to track the process of page split,
page merge, page migration, page allocation and page free.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7278ccd6

mm/dynamic_hugetlb: free huge pages to dhugetlb_pool · 6ede0f00

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add function to free huge page to dhugetlb_pool.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6ede0f00

mm/dynamic_hugetlb: alloc huge pages from dhugetlb_pool · 39eec758

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add function to alloc huge page from dhugetlb_pool.
When process is bound to a mem_cgroup configured with dhugetlb_pool,
only allowed to alloc huge page from dhugetlb_pool. If there is no huge
pages in dhugetlb_pool, the mmap() will failed due to the reserve count
introduced in previous patch.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

39eec758

mm/dynamic_hugetlb: collects resv allocated for dhugetlb_pool · 5993c1d6

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

The dynamic hugetlb feature is based on hugetlb. There is a reserve count in
hugetlb to determine if there were enough free huge pages to satisfy the
requirement while mmap() to avoid SIGBUS at the next page fault time. Add similar
count for dhugetlb_pool to avoid same problem.

References: Documentation/vm/hugetlbfs_reserv.rst
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5993c1d6

mm/dynamic_hugetlb: add interface to disable normal pages allocation · cf8510b3

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add new interface "dhugetlb.disable_normal_pages" to disable the allocation
of normal pages from a hpool. This makes dynamic hugetlb more flexible.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cf8510b3

mm/dynamic_hugetlb: free pages to dhugetlb_pool · 71197c63

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add function to free page to dhugetlb_pool.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

71197c63

mm/dynamic_hugetlb: alloc page from dhugetlb_pool · 32d6d14f

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add function to alloc page from dhugetlb_pool.
When process is bound to a mem_cgroup configured with dhugtlb_pool, alloc
page from dhugetlb_pool firstly. If there is no page in dhugetlb_pool,
fallback to alloc page from buddy system.

As the process will alloc pages from dhugetlb_pool in the mem_cgroup,
process is not allowed to migrate to other mem_cgroup.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

32d6d14f

mm/dynamic_hugetlb: add migration function · cdbeee51

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Sometimes, page merge may failed because some pages are still in use.
Add migration function to enhance the merge function.
This function relies on memory hotremove, so it only works when config
MEMORY_HOTREMOVE is selected.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cdbeee51

mm/dynamic_hugetlb: add merge page function · 29617b44

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

When destroying hpool or alloc huge pages, the pages has been split
may need to be merged to huge pages. Add functions to merge pages in
dhugetlb_pool. The information about split huge pages has been recorded
in hugepage_splitlists and can traverse it to merge huge pages.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

29617b44

mm/dynamic_hugetlb: add split page function · 51715f71

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Currently, dynamic hugetlb support 1G/2M/4K pages. In the beginning,
there were only 1G pages in the hpool. Add function to split pages
in dhugetlb_pool. If 4K pages are insufficient, try to split 2M pages,
and if 2M pages are insufficient, try to split 1G pages.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

51715f71

mm/dynamic_hugetlb: add interface to configure the count of hugepages · 0c06a1c0

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Add two interfaces in mem_cgroup to configure the count of 1G/2M hugepages
in dhugetlb_pool.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0c06a1c0

mm/dynamic_hugetlb: establish the dynamic hugetlb feature framework · a8a836a3

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

Dynamic hugetlb is a self-developed feature based on the hugetlb and memcontrol.
It supports to split huge page dynamically in a memory cgroup. There is a new structure
dhugetlb_pool in every mem_cgroup to manage the pages configured to the mem_cgroup.
For the mem_cgroup configured with dhugetlb_pool, processes in the mem_cgroup will
preferentially use the pages in dhugetlb_pool.

Dynamic hugetlb supports three types of pages, including 1G/2M huge pages and 4K pages.
For the mem_cgroup configured with dhugetlb_pool, processes will be limited to alloc
1G/2M huge pages only from dhugetlb_pool. But there is no such constraint for 4K pages.
If there are insufficient 4K pages in the dhugetlb_pool, pages can also be allocated from
buddy system. So before using dynamic hugetlb, user must know how many huge pages they
need.

Usage:
1. Add 'dynamic_hugetlb=on' in cmdline to enable dynamic hugetlb feature.
2. Prealloc some 1G hugepages through hugetlb.
3. Create a mem_cgroup and configure dhugetlb_pool to mem_cgroup.
4. Configure the count of 1G/2M hugepages, and the remaining pages in dhugetlb_pool will
be used as basic pages.
5. Bound a process to mem_cgroup. then the memory for it will be allocated from dhugetlb_pool.

This patch add the corresponding structure dhugetlb_pool for dynamic hugetlb feature,
the interface 'dhugetlb.nr_pages' in mem_cgroup to configure dhugetlb_pool and the cmdline
'dynamic_hugetlb=on' to enable dynamic hugetlb feature.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a8a836a3

mm/hugetlb: add parameter hugetlbfs_inode_info to several functions · 5f53feed

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

In next patches, struct hugetlbfs_inode_info will be used to check whether
a hugetlbfs file has memory in hpool, so add paramter hugetlbfs_inode_info
to related functions, including hugetlb_acct_memory/hugepage_subpool_get_pages/
hugepage_subpool_put_pages.

No functional changes.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5f53feed

mm: declare several functions · 98ecb3cd

由 Liu Shixin 提交于 1月 18, 2022

hulk inclusion
category: feature
bugzilla: 46904, https://gitee.com/openeuler/kernel/issues/I4QSHG
CVE: NA

--------------------------------

There are several functions that will be used in next patches for
dynamic hugetlb feature. Declare them.

No functional changes.
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

98ecb3cd

17 1月, 2022 3 次提交

hugetlbfs: fix issue of preallocation of gigantic pages can't work · b4cd3518

由 Zhenguo Yao 提交于 1月 17, 2022

mainline inclusion
from mainline-v5.16-rc5
commit 4178158e
category: bugfix
bugzilla: 186043, https://gitee.com/openeuler/kernel/issues/I4QSF4
CVE: NA

--------------------------------

Preallocation of gigantic pages can't work bacause of commit
b5389086 ("hugetlbfs: extend the definition of hugepages parameter
to support node allocation").  When nid is NUMA_NO_NODE(-1),
alloc_bootmem_huge_page will always return without doing allocation.
Fix this by adding more check.

Link: https://lkml.kernel.org/r/20211129133803.15653-1-yaozhenguo1@gmail.com
Fixes: b5389086 ("hugetlbfs: extend the definition of hugepages parameter to support node allocation")
Signed-off-by: NZhenguo Yao <yaozhenguo1@gmail.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Tested-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b4cd3518

hugetlbfs: extend the definition of hugepages parameter to support node allocation · f4ada01a

由 Zhenguo Yao 提交于 1月 17, 2022

mainline inclusion
from mainline-v5.16-rc1
commit b5389086
category: feature
bugzilla: 186043, https://gitee.com/openeuler/kernel/issues/I4QSF4
CVE: NA

--------------------------------

We can specify the number of hugepages to allocate at boot.  But the
hugepages is balanced in all nodes at present.  In some scenarios, we
only need hugepages in one node.  For example: DPDK needs hugepages
which are in the same node as NIC.

If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
hugepages are used.  The others should be free after boot.  If the
system memory is low(for example: 64G), it will be an impossible task.

So extend the hugepages parameter to support specifying hugepages on a
specific node.  For example add following parameter:

  hugepagesz=1G hugepages=0:1,1:3

It will allocate 1 hugepage in node0 and 3 hugepages in node1.

Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.comSigned-off-by: NZhenguo Yao <yaozhenguo1@gmail.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/hugetlb.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f4ada01a

arm64: Support execute-only permissions with Enhanced PAN · 83ca49da

由 Vladimir Murzin 提交于 1月 17, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 18107f8a
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4QUF2
CVE: NA

----------------------

Enhanced Privileged Access Never (EPAN) allows Privileged Access Never
to be used with Execute-only mappings.

Absence of such support was a reason for 24cecc37 ("arm64: Revert
support for execute-only user mappings"). Thus now it can be revisited
and re-enabled.

Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
Acked-by: NWill Deacon <will@kernel.org>
Link: https://lore.kernel.org/r/20210312173811.58284-2-vladimir.murzin@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

 Conflicts:
	arch/arm64/Kconfig
	arch/arm64/include/asm/cpucaps.h
[wangxiongfeng: fix conflicts caused by context mismatch.]
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

83ca49da

15 1月, 2022 1 次提交

x86: hugepage: use nt copy hugepage to AEP in x86 · 50d5bf1b

由 Kemeng Shi 提交于 1月 15, 2022

euleros inclusion
category: feature
feature: etmem
bugzilla: https://gitee.com/openeuler/kernel/issues/I4OODH?from=project-issue
CVE: NA

-------------------------------------------------

Add proc/sys/vm/hugepage_nocache_copy switch. Set 1 to copy hugepage
with movnt SSE instructoin if cpu support it. Set 0 to copy hugepage
as usual.
Signed-off-by: NKemeng Shi <shikemeng@huawei.com>
Reviewed-by: Nlouhongxiang <louhongxiang@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

50d5bf1b

14 1月, 2022 7 次提交

mm: bdi: initialize bdi_min_ratio when bdi is unregistered · bff0fdb8

由 Manjong Lee 提交于 1月 14, 2022

stable inclusion
from stable-v5.10.85
commit c581090228e3aeabb5081c1db8b2024ae8478f5b
bugzilla: 186032 https://gitee.com/openeuler/kernel/issues/I4QVI4

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c581090228e3aeabb5081c1db8b2024ae8478f5b

--------------------------------

commit 3c376dfa upstream.

Initialize min_ratio if it is set during bdi unregistration.  This can
prevent problems that may occur a when bdi is removed without resetting
min_ratio.

For example.
1) insert external sdcard
2) set external sdcard's min_ratio 70
3) remove external sdcard without setting min_ratio 0
4) insert external sdcard
5) set external sdcard's min_ratio 70 << error occur(can't set)

Because when an sdcard is removed, the present bdi_min_ratio value will
remain.  Currently, the only way to reset bdi_min_ratio is to reboot.

[akpm@linux-foundation.org: tweak comment and coding style]

Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.comSigned-off-by: NManjong Lee <mj0123.lee@samsung.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Changheun Lee <nanich.lee@samsung.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: <seunghwan.hyun@samsung.com>
Cc: <sookwan7.kim@samsung.com>
Cc: <yt0928.kim@samsung.com>
Cc: <junho89.kim@samsung.com>
Cc: <jisoo2146.oh@samsung.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bff0fdb8

mm: usercopy: Warn vmalloc/module address in check_heap_object() · e6b24cb6

由 Kefeng Wang 提交于 1月 14, 2022

hulk inclusion
category: bugfix
bugzilla: 186017 https://gitee.com/openeuler/kernel/issues/I4DDEL

--------------------------------

virt_addr_valid() could be insufficient to validate the virt addr
on some architecture, which could lead to potential BUG which has
been found on arm64/powerpc64.

Let's add WARN_ON to check if the virt addr is passed virt_addr_valid()
but is a vmalloc/module address.
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e6b24cb6

mm/hwpoison: clear MF_COUNT_INCREASED before retrying get_any_page() · b822cf23

由 Liu Shixin 提交于 1月 14, 2022

mainline inclusion
from mainline-v5.16-rc7
commit 2a57d83c
category: bugfix
bugzilla: 185855 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2a57d83c78f889bf3f54eede908d0643c40d5418

--------------------------------

Hulk Robot reported a panic in put_page_testzero() when testing
madvise() with MADV_SOFT_OFFLINE.  The BUG() is triggered when retrying
get_any_page().  This is because we keep MF_COUNT_INCREASED flag in
second try but the refcnt is not increased.

    page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
    ------------[ cut here ]------------
    kernel BUG at include/linux/mm.h:737!
    invalid opcode: 0000 [#1] PREEMPT SMP
    CPU: 5 PID: 2135 Comm: sshd Tainted: G    B             5.16.0-rc6-dirty #373
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    RIP: release_pages+0x53f/0x840
    Call Trace:
      free_pages_and_swap_cache+0x64/0x80
      tlb_flush_mmu+0x6f/0x220
      unmap_page_range+0xe6c/0x12c0
      unmap_single_vma+0x90/0x170
      unmap_vmas+0xc4/0x180
      exit_mmap+0xde/0x3a0
      mmput+0xa3/0x250
      do_exit+0x564/0x1470
      do_group_exit+0x3b/0x100
      __do_sys_exit_group+0x13/0x20
      __x64_sys_exit_group+0x16/0x20
      do_syscall_64+0x34/0x80
      entry_SYSCALL_64_after_hwframe+0x44/0xae
    Modules linked in:
    ---[ end trace e99579b570fe0649 ]---
    RIP: 0010:release_pages+0x53f/0x840

Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com
Fixes: b94e0282 ("mm,hwpoison: try to narrow window race for free pages")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reported-by: NHulk Robot <hulkci@huawei.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Conflicts:
	mm/memory-failure.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b822cf23

hugetlb: address ref count racing in prep_compound_gigantic_page · df906dae

由 Mike Kravetz 提交于 1月 14, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 7118fc29
category: bugfix
bugzilla:171843

-----------------------------------------------

In [1], Jann Horn points out a possible race between
prep_compound_gigantic_page and __page_cache_add_speculative.  The root
cause of the possible race is prep_compound_gigantic_page uncondittionally
setting the ref count of pages to zero.  It does this because
prep_compound_gigantic_page is handed a 'group' of pages from an allocator
and needs to convert that group of pages to a compound page.  The ref
count of each page in this 'group' is one as set by the allocator.
However, the ref count of compound page tail pages must be zero.

The potential race comes about when ref counted pages are returned from
the allocator.  When this happens, other mm code could also take a
reference on the page.  __page_cache_add_speculative is one such example.
Therefore, prep_compound_gigantic_page can not just set the ref count of
pages to zero as it does today.  Doing so would lose the reference taken
by any other code.  This would lead to BUGs in code checking ref counts
and could possibly even lead to memory corruption.

There are two possible ways to address this issue.

1) Make all allocators of gigantic groups of pages be able to return a
   properly constructed compound page.

2) Make prep_compound_gigantic_page be more careful when constructing a
   compound page.

This patch takes approach 2.

In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
that the extra ref count will be driopped during a rcu grace period.  This
is not a performance critical code path and the wait should be
accceptable.  If the ref count is still inflated after the grace period,
then undo any modifications made and return an error.

Currently prep_compound_gigantic_page is type void and does not return
errors.  Modify the two callers to check for and handle error returns.  On
error, the caller must free the 'group' of pages as they can not be used
to form a gigantic page.  After freeing pages, the runtime caller
(alloc_fresh_huge_page) will retry the allocation once.  Boot time
allocations can not be retried.

The routine prep_compound_page also unconditionally sets the ref count of
compound page tail pages to zero.  However, in this case the buddy
allocator is constructing a compound page from freshly allocated pages.
The ref count on those freshly allocated pages is already zero, so the
set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
remove it.

[1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/

Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reported-by: NJann Horn <jannh@google.com>
Cc: Youquan Song <youquan.song@intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

df906dae

memblock: ensure there is no overflow in memblock_overlaps_region() · c3c124b2

由 Mike Rapoport 提交于 1月 14, 2022

mainline inclusion
from mainline-v5.14-rc1
commit 023accf5
category: bugfix
bugzilla: 172285 https://gitee.com/openeuler/kernel/issues/I4DDEL

-----------------------------------------------

There maybe an overflow in memblock_overlaps_region() if it is called with
base and size such that

	base + size > PHYS_ADDR_MAX

Make sure that memblock_overlaps_region() caps the size to prevent such
overflow and remove now duplicated call to memblock_cap_size() from
memblock_is_region_reserved().
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Tested-by: NTony Lindgren <tony@atomide.com>
Signed-off-by: NChen Huang <chenhuang5@huawei.com>
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c3c124b2

mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag · a330dfbd

由 Rustam Kovhaev 提交于 1月 14, 2022

stable inclusion
form stable-v5.10.82
commit b2e2fb64071a00df54d904858b591590de369108
bugzilla: 185877 https://gitee.com/openeuler/kernel/issues/I4QU6V

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b2e2fb64071a00df54d904858b591590de369108

--------------------------------

commit 34dbc3aa upstream.

When kmemleak is enabled for SLOB, system does not boot and does not
print anything to the console.  At the very early stage in the boot
process we hit infinite recursion from kmemleak_init() and eventually
kernel crashes.

kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.

Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
Fixes: d8843922 ("slab: Ignore internal flags in cache creation")
Signed-off-by: NRustam Kovhaev <rkovhaev@gmail.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Glauber Costa <glommer@parallels.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a330dfbd

kfence: fix memory leak when cat kfence objects · da4cb67f

由 Baokun Li 提交于 1月 14, 2022

hulk inclusion
category: bugfix
bugzilla: 185858 https://gitee.com/openeuler/kernel/issues/I4DDEL

-------------------------------------------------

Hulk robot reported a kmemleak problem:
-----------------------------------------------------------------------
unreferenced object 0xffff93d1d8cc02e8 (size 248):
  comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
  hex dump (first 32 bytes):
    00 40 85 19 d4 93 ff ff 00 10 00 00 00 00 00 00  .@..............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000db5610b3>] seq_open+0x2a/0x80
    [<00000000d66ac99d>] full_proxy_open+0x167/0x1e0
    [<00000000d58ef917>] do_dentry_open+0x1e1/0x3a0
    [<0000000016c91867>] path_openat+0x961/0xa20
    [<00000000909c9564>] do_filp_open+0xae/0x120
    [<0000000059c761e6>] do_sys_openat2+0x216/0x2f0
    [<00000000b7a7b239>] do_sys_open+0x57/0x80
    [<00000000e559d671>] do_syscall_64+0x33/0x40
    [<000000000ea1fbfd>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
unreferenced object 0xffff93d419854000 (size 4096):
  comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
  hex dump (first 32 bytes):
    6b 66 65 6e 63 65 2d 23 32 35 30 3a 20 30 78 30  kfence-#250: 0x0
    30 30 30 30 30 30 30 37 35 34 62 64 61 31 32 2d  0000000754bda12-
  backtrace:
    [<000000008162c6f2>] seq_read_iter+0x313/0x440
    [<0000000020b1b3e3>] seq_read+0x14b/0x1a0
    [<00000000af248fbc>] full_proxy_read+0x56/0x80
    [<00000000f97679d1>] vfs_read+0xa5/0x1b0
    [<000000000ed8a36f>] ksys_read+0xa0/0xf0
    [<00000000e559d671>] do_syscall_64+0x33/0x40
    [<000000000ea1fbfd>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
-----------------------------------------------------------------------

I find that we can easily reproduce this problem with the following
commands:
	`cat /sys/kernel/debug/kfence/objects`
	`echo scan > /sys/kernel/debug/kmemleak`
	`cat /sys/kernel/debug/kmemleak`

The leaked memory is allocated in the stack below:
----------------------------------
do_syscall_64
  do_sys_open
    do_dentry_open
      full_proxy_open
        seq_open            ---> alloc seq_file
  vfs_read
    full_proxy_read
      seq_read
        seq_read_iter
          traverse          ---> alloc seq_buf
----------------------------------

And it should have been released in the following process:
----------------------------------
do_syscall_64
  syscall_exit_to_user_mode
    exit_to_user_mode_prepare
      task_work_run
        ____fput
          __fput
            full_proxy_release  ---> free here
----------------------------------

However, the release function corresponding to file_operations is not
implemented in kfence. As a result, a memory leak occurs. Therefore,
the solution to this problem is to implement the corresponding
release function.

Link: https://lkml.kernel.org/r/20211206133628.2822545-1-libaokun1@huawei.com
Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure")
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reported-by: NHulk Robot <hulkci@huawei.com>
Acked-by: NMarco Elver <elver@google.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

da4cb67f

08 1月, 2022 2 次提交

hugepage: add sysctl for hugepage alloc and mig · 80ed6b32

由 Kemeng Shi 提交于 1月 08, 2022

euleros inclusion
category: feature
feature: etmem
bugzilla: https://gitee.com/openeuler/kernel/issues/I4OODH?from=project-issue
CVE: NA

-------------------------------------------------

Add /proc/sys/kernel/hugepage_pmem_allocall switch. Set 1 to allowed all
memory in pmem could alloc for hugepage. Set 0(default) hugepage alloc is
limited by zone watermark as usual.
Add /proc/sys/kernel/hugepage_mig_noalloc switch. Set 1 to forbid new
hugepage alloc in hugepage migration when hugepage in dest node runs
out. Set 0(default) to allow hugepage alloc in hugepage migration as
usual.
Signed-off-by: NKemeng Shi <shikemeng@huawei.com>
Reviewed-by: Nlouhongxiang <louhongxiang@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

80ed6b32

acpi/numa: memorize node type from SRAT table · 1c76b8cf

由 Kemeng Shi 提交于 1月 08, 2022

euleros inclusion
category: feature
feature: etmem
bugzilla: https://gitee.com/openeuler/kernel/issues/I4OODH?from=project-issue
CVE: NA

-------------------------------------------------

Driver dax_kmem will export pmem as a NUMA node. This patch will
record node consists of persistent memory for futher use.
Signed-off-by: NKemeng Shi <shikemeng@huawei.com>
Reviewed-by: Nlouhongxiang <louhongxiang@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1c76b8cf

07 1月, 2022 6 次提交

memcg: Add static key for memcg kswapd · 84355bcc

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

This patch adds a default-false static key to disable memcg kswapd
feature. User can enable by set memcg_kswapd in cmdline.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

84355bcc

memcg: make memcg kswapd deal with dirty · 70d020ae

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

The memcg kswapd could set dirty state to memcg if current scan find all
pages are unqueued dirty in the memcg. Then kswapd would write out dirty pages.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

70d020ae

memcg: support memcg sync reclaim work as kswapd · 1496d67c

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Since memory.high reclaim is sync whether is in interrupt, it could
do more work than direct reclaim, i.e. write out dirty page, etc.

So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
for memcg kswapd.

Memcg kswapd should stop when usage of memcg fit the memcg kswapd stop
flag. When the userland sets the memcg->memory.max, the stop_flag is
(memcg->memory.high - memcg->memory.max * 10 / 1000), which is similar
with global kswapd. Otherwise, the stop_flag is (memcg->memory.high -
memcg->memory.high / 6), which is similar with most difference between
watermark_low and watermark_high.

And, memcg kswapd should not break memory.low protection for now.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1496d67c

memcg: Export memcg.high from cgroupv2 to cgroupv1 · 6a7b3e98

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Export memory.high from cgroupv2 to cgroupv1. Therefore, when the usage
of the memcg is larger than memory.high, some pages will be reclaimed
before return to userland, which will throttle the process.

Only export memory.high number in mem_cgroup_legacy_files and move
related functions in front of mem_cgroup_legacy_files. There is no need
to other changes.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6a7b3e98

memcg: Export memcg.{min/low} from cgroupv2 to cgroupv1 · 27c047f4

由 Lu Jialin 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4IMAK?from=project-issue
CVE: NA

--------

Export memcg.min and memcg.low from cgroupv2 to cgroupv1, in order to reduce
the negtive impact between cgroups when the system memory is insufficient.

Only export memory.{min/low} numbers in mem_cgroup_legacy_files and move
related functions in front of mem_cgroup_legacy_files. There is no need
to other changes.
Signed-off-by: NLu Jialin <lujialin4@huawei.com>
Reviewed-by: Nweiyang wang <wangweiyang2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

27c047f4

arm64: Request resources for reserved memory via memmap · 374db2be

由 Peng Liu 提交于 1月 07, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4NYPZ
CVE: NA

-------------------------------------------------

A new flag MEMBLOCK_MEMMAP is added into memblock_flags, which is
used to identify reserved memory for memmap. This flag is limited
for arm64. When memmap memory is reserved by memblock_reserve, it
is subsequently marked with flag MEMBLOCK_MEMMAP. Therefore,
for_each_mem_region can find memmap memory and request resources
for it.
Signed-off-by: NPeng Liu <liupeng256@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

374db2be

30 12月, 2021 6 次提交

share_pool: Use sharepool_no_page to alloc hugepage · c533562a