提交 · ea80e1d901b80ac7d9c498b2f7c6a602fabcb24a · openeuler / Kernel

09 2月, 2022 40 次提交

mm: Modify sharepool sp_mmap() page_offset · ea80e1d9

由 Guo Mengqi 提交于 2月 09, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SPNL
CVE: NA

-----------------------------------

In sp_mmap(), if use offset = va - MMAP_BASE/DVPP_BASE, then normal
sp_alloc pgoff may have same value with DVPP pgoff, causing DVPP
and sp_alloc mapped to overlapped part of file unexpectedly.

To fix the problem, pass VA value as mmap offset, for in this scenario,
VA value in one task address space will not be same.
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NDing Tianhong <dingtianhong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ea80e1d9

support multiple node for getting phys interface · 51d0e1a7

由 Yuan Can 提交于 2月 09, 2022

ascend inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4S786
CVE: NA

-------------------------------------------------------
Signed-off-by: NYuan Can <yuancan@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

51d0e1a7

share_pool: Accept device_id in k2u flags · 1db43c8a

由 Wang Wensheng 提交于 2月 09, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8
CVE: NA

-------------------------------------------------

We use device_id to select the correct dvpp vspace range when SP_DVPP
flag is specified.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1db43c8a

share_pool: Clear the usage of node_id and device_id · 5fe50a03

由 Wang Wensheng 提交于 2月 09, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8
CVE: NA

-------------------------------------------------

Device_id is used for DVPP to select the correct virtual address space
and node_id is used to specify the node where we want to alloc physical
memory from. Those two don't have to be the same in theory.

Actually, the process runs always on the numa nodes corresponding to the
device the process used and the node with the same id as the device is
always belongs the the device. So using device_id as node_id to alloc
memory could work.

However the number of numa nodes belongs to a specified device is not
always one and we cannot use other numa nodes of the device.

Here we introduce a new flag SP_SPEC_NODE_ID and add a bit-region in
sp_flags for those who want to use other nodes belongs to a device. That
is, if one want to specify the node_id, the new flag and the node_id
should be both added to the sp_flags when calling sp_alloc() or
sp_make_share_k2u(), otherwise the node with the same id as the device
would be in use.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5fe50a03

share_pool: Make multi-device support extendable · 38f74cfb

由 Wang Wensheng 提交于 2月 09, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8
CVE: NA

-------------------------------------------------

The maximum devices supported in share_pool in static. Here we make it
extendable for later use.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

38f74cfb

share_pool: Fix flags conflict · 8779ca9e

由 Wang Wensheng 提交于 2月 09, 2022

ascend inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8
CVE: NA

-------------------------------------------------

MAP_SHARE_POOL and MAP_FIXED_NOREPLACE have the same value.
Redefine MAP_SHARE_POOL to fix it.
Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8779ca9e

config: enable MEMORY_RELIABLE by default · 0d7ec6dd

由 Yang Yingliang 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Enable CONFIG_MEMORY_RELIABLE CONFIG_CLEAR_FREELIST_PAGE
and CONFIG_EFI_FAKE_MEMMAP for test
Reviewed-by: NChen Wandun <chenwandun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0d7ec6dd

mm: add sysctl to clear free list pages · 8b805498

由 Yu Liao 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

This patch add sysctl to clear pages in free lists of each NUMA node.
For each NUMA node, clear each page in the free list, these work is
scheduled on a random CPU of the NUMA node.

When kasan is enabled and the pages are free, the shadow memory will be
filled with 0xFF, writing these free pages will cause UAF, so just
disable KASAN for clear freelist.
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8b805498

workqueue: Provide queue_work_node to queue work near a given NUMA node · 53d39163

由 Alexander Duyck 提交于 2月 09, 2022

stable inclusion
from stable-5.10.88
commit 8204e0c1
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S

--------------------------------

Provide a new function, queue_work_node, which is meant to schedule work on
a "random" CPU of the requested NUMA node. The main motivation for this is
to help assist asynchronous init to better improve boot times for devices
that are local to a specific node.

For now we just default to the first CPU that is in the intersection of the
cpumask of the node and the online cpumask. The only exception is if the
CPU is local to the node we will just use the current CPU. This should work
for our purposes as we are currently only using this for unbound work so
the CPU will be translated to a node anyway instead of being directly used.

As we are only using the first CPU to represent the NUMA node for now I am
limiting the scope of the function so that it can only be used with unbound
workqueues.
Acked-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Acked-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

53d39163

mm:vmscan: add the missing check of page_cache_over_limit · 92cd2e7f

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Function shrink_shepherd is used to queue each work on cpu to shrink
page cache, and it will be called periodically, but if there is no
page_cache_over_limit check before shrink page cache, it will result
in periodic memory reclamation even the number of page cache below
limit, so add basic check before shrink page cache.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

92cd2e7f

sysctl: add proc interface to set page cache limit · a08e2009

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------
Add two proc interfaces to set page cache limit. Both
vm_cache_limit_mbytes and vm_cache_limit_ratio will
be update when writing either of the two interfaces.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a08e2009

mm/vmscan: dont do shrink_slab in reclaim page cache · 25d463b5

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

If page cache is over limit, it will trigger page cache reclaimation,
only page cache should be reclaimed, but slab will be reclaimed by
default in shrink_node, so disable shrink_slab by adding a control
parameter in scan_control.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

25d463b5

mm/vmscan: dont reclaim anon page when shrink page cache · 94ee86c7

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

The number of page cache should be limited in a range if
enable CONFIG_MEMORY_RELIABLE, so only page cache instead
of both file + anono page should be reclaimed during page
cache reclaimtion.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

94ee86c7

filemap: dont shrink_page_cache in add_to_page_cache · 04ec8364

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

The reason of disable shrink_page_cache in add_to_page_cache are:

1. Synchronous memory reclamation will affect performance.
2. add_to_page_cache will not increase the number of LRU size in
   HugeTLB situation, so shrink_page_cache will not be triggered.

Now, add_to_page_cache in mm/filemap.c and include/linux/pagemap.h
are same, don't delete add_to_page_cache in mm/filemap.c, just keep
interface for KABI.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

04ec8364

mm/vmscan: fix unexpected shrinking page cache with vm_cache_reclaim_enable disable · 76569c77

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

-------------------------------------------------------

In function cache_limit_ratio_sysctl_handler and
cache_limit_mbytes_sysctl_handler, it will shrink
page cache even if vm_cache_reclaim_enable is false,
it is unexpected.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

76569c77

mm/vmscan: fix frequent call of shrink_page_cache_work · 18849ae6

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

If vmcache_reclaim_s > 120, it will call shrink_page_cache_work
after 120 seconds even shrinking is hard, that is shorter than
vmcache_reclaim_s, deviating from the original intention of extending
the interval.

In order to solve this, shrink_page_cache_work should be call
after vmcache_reclaim_s + 120.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

18849ae6

proc/meminfo: add "FileCache" item in /proc/meminfo · c05f56c6

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------
Item "FileCache" in /proc/meminfo show the number of page cache
in LRU(active + inactive).
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c05f56c6

mm: add page cache fallback statistic · 925368d8

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Add page cache fallback statistic, the counter will overflow
after a period time of use, only reset to zero, no negative
effect.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

925368d8

mm: add cmdline for the reliable memory usage of page cache · f5c69190

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Add cmdline for the reliable memory usage of page cache.
Page cache will not use reliable memory when passing option
"P" to reliable_debug in cmdline.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f5c69190

mm: make page cache use reliable memory by default · c0019109

由 Chen Wandun 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

__page_cache_alloc is used to alloc page cache in most file system,
such as ext4, f2fs, so add ___GFP_RELIABILITY flag to support feature
CONFIG_MEMORY_RELIABLE when alloc page.
Signed-off-by: NChen Wandun <chenwandun@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c0019109

shmem: Show reliable shmem info · fc2c1dc8

由 Zhou Guanghui 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

------------------------------------------

Add ReliableShmem in /proc/meminfo to show reliable memory info
used by shmem.

- ReliableShmem: reliable memory used by shmem
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fc2c1dc8

shmem: Introduce shmem reliable · 3a3a1f75

由 Zhou Guanghui 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

------------------------------------------

This feature depends on the overall memory reliable feature.
When the shared memory reliable feature is enabled, the pages
used by the shared memory are allocated from the mirrored
region by default. If the mirrored region is insufficient,
you can allocate resources from the non-mirrored region.
Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3a3a1f75

mm: Introduce fallback mechanism for memory reliable · 3023a4b3

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Introduce fallback mechanism for memory reliable. The following process
will fallback to non-mirrored region if their allocation from mirrored
region failed

- User tasks with reliable flag
- thp collapse pages
- init tasks
- pagecache
- tmpfs

In order to achieve this goals. Buddy system will fallback to non-mirrored
in the following situations.

- if __GFP_THISNODE is set in gfp_mask and dest nodes do not have any zones
  available

- high_zoneidx will be set to ZONE_MOVABLE to alloc memory before oom

This mechanism is enabled by defalut and can be disabled by adding
"reliable_debug=F" to the kernel parameters. This mechanism rely on
CONFIG_MEMORY_RELIABLE and need "kernelcore=reliable" in the kernel
parameters.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3023a4b3

mm: Add reliable memory use limit for user tasks · 1845e7ad

由 Peng Wu 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

----------------------------------------------

there is a upper limit for special user tasks's memory allocation.
special user task means user task with reliable flag.

Init tasks will alloc memory from non-mirrored region if their allocation
trigger limit.

The limit can be set or access via /proc/sys/vm/task_reliable_limit

This limit's default value is ULONG_MAX.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1845e7ad

mm: thp: Add memory reliable support for hugepaged collapse · ff0fb9e8

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Hugepaged collapse pages into huge page will use the same memory region.
When hugepaged collapse pages into huge page, hugepaged will check if
there is any reliable pages in the area to be collapsed. If this area
contains any reliable pages, hugepaged will alloc memory from mirrored
region. Otherwise it will alloc momory from non-mirrored region.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ff0fb9e8

proc: Count reliable memory usage of reliable tasks · 094eaabb

由 Peng Wu 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

----------------------------------------------

Counting reliable memory allocated by the reliable user tasks.

The policy of counting reliable memory usage is based on RSS statistics.
Anywhere with counter of mm need count reliable pages too. Reliable page
which is checked by page_reliable() need to update the reliable page
counter by calling reliable_page_counter().

Updating the reliable pages should be considered if the following logic is
added:
- add_mm_counter
- dec_mm_counter
- inc_mm_counter_fast
- dec_mm_counter_fast
- rss[mm_counter(page)]
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

094eaabb

mm: Add reliable_nr_page for accounting reliable memory · 8eb421e3

由 Peng Wu 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

-------------------------------------------------

Adding an variable in mm_struct for accouting the amount of reliable memory
allocated by the reliable user tasks.

Use KABI_RESERVE(3) in mm_struct to avoid any kapi changes.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8eb421e3

mm: Introduce reliable flag for user task · c7731567

由 Peng Wu 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

------------------------------------------

Adding reliable flag for user task. User task with reliable flag can
only alloc memory from mirrored region. PF_RELIABLE is added to represent
the task's reliable flag.

- For init task, which is regarded as as special task which alloc memory
  from mirrored region.

- For normal user tasks, The reliable flag can be set via procfs interface
  shown as below and can be inherited via fork().

User can change a user task's reliable flag by

	$ echo [0/1] > /proc/<pid>/reliable

and check a user task's reliable flag by

	$ cat /proc/<pid>/reliable

Note, global init task's reliable file can not be accessed.
Signed-off-by: NPeng Wu <wupeng58@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c7731567

meminfo: Show reliable memory info · b1f317c6

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Add ReliMemTotal & ReliMemUsed in /proc/meminfo to show memory info about
reliable memory.

- ReliableTotal: total reliable RAM

- ReliableUsed: thei used amount of reliable memory kernel
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b1f317c6

mm: Introduce memory reliable · 33d1f46a

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Introduction
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>

============

Memory reliable feature is a memory tiering mechanism. It is based on
kernel mirror feature, which splits memory into two sperate regions,
mirrored(reliable) region and non-mirrored (non-reliable) region.

for kernel mirror feature:

- allocate kernel memory from mirrored region by default
- allocate user memory from non-mirrored region by default

non-mirrored region will be arranged into ZONE_MOVABLE.

for kernel reliable feature, it has additional features below:

- normal user tasks never alloc memory from mirrored region with userspace
  apis(malloc, mmap, etc.)
- special user tasks will allocate memory from mirrored region by default
- tmpfs/pagecache allocate memory from mirrored region by default
- upper limit of mirrored region allcated for user tasks, tmpfs and
  pagecache

Support Reliable fallback mechanism which allows special user tasks, tmpfs
and pagecache can fallback to alloc non-mirrored region, it's the default
setting.

In order to fulfil the goal

- ___GFP_RELIABILITY flag added for alloc memory from mirrored region.

- the high_zoneidx for special user tasks/tmpfs/pagecache is set to
  ZONE_NORMAL.

- normal user tasks could only alloc from ZONE_MOVABLE.

This patch is just the main framework, memory reliable support for special
user tasks, pagecache and tmpfs has own patches.

To enable this function, mirrored(reliable) memory is needed and
"kernelcore=reliable" should be added to kernel parameters.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

33d1f46a

efi: Find mirrored memory ranges for arm64 · 20942c83

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Mirrored memory could be used on HiSilion's arm64 SoC. So efi_find_mirror()
is added in efi_init() so that systems can get memblock about any mirrored
ranges.
Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

20942c83

efi: Make efi_find_mirror() public · d9597589

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Commit b05b9f5f ("x86, mirror: x86 enabling - find mirrored memory
ranges") introduce the efi_find_mirror function on x86. In order to reuse
the API we make it public in preparation for arm64 to support mirrord
memory.
Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d9597589

arm64: efi: Add fake memory support · e16eb8d7

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Fake memory map is used for faking memory's attribute values.
Commit 0f96a99d ("efi: Add "efi_fake_mem" boot option") introduce the
efi_fake_mem function. Now it can support arm64 with this patch.
For example you can mark 0-6G memory as EFI_MEMORY_MORE_RELIABLE by adding
efi_fake_mem=6G@0:0x10000 in the bootarg. You find more info about
fake memmap in kernel-parameters.txt.
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e16eb8d7

efi: Make efi_print_memmap() public · 561065e2

由 Ma Wupeng 提交于 2月 09, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Make efi_print_memmap() public in preparation for adding fake memory
support for architecture with efi support, eg, arm64.
Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

561065e2

mm/memory_hotplug: allow to specify a default online_type · c5031cab

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit 5f47adf7
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

For now, distributions implement advanced udev rules to essentially
- Don't online any hotplugged memory (s390x)
- Online all memory to ZONE_NORMAL (e.g., most virt environments like
  hyperv)
- Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
  care of (e.g., bare metal, special virt environments)

In summary: All memory is usually onlined the same way, however, the
kernel always has to ask user space to come up with the same answer.
E.g., Hyper-V always waits for a memory block to get onlined before
continuing, otherwise it might end up adding memory faster than
onlining it, which can result in strange OOM situations.  This waiting
slows down adding of a bigger amount of memory.

Let's allow to specify a default online_type, not just "online" and
"offline".  This allows distributions to configure the default online_type
when booting up and be done with it.

We can now specify "offline", "online", "online_movable" and
"online_kernel" via
- "memhp_default_state=" on the kernel cmdline
- /sys/devices/system/memory/auto_online_blocks
just like we are able to specify for a single memory block via
/sys/devices/system/memory/memoryX/state
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c5031cab

mm/memory_hotplug: convert memhp_auto_online to store an online_type · b094d7ef

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit 862919e5
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

...  and rename it to memhp_default_online_type.  This is a preparation
for more detailed default online behavior.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Wupeng: keep memhp_auto_online for kabi]
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b094d7ef

hv_balloon: don't check for memhp_auto_online manually · 8b1847f3

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit bc58ebd5
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

We get the MEM_ONLINE notifier call if memory is added right from the
kernel via add_memory() or later from user space.

Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt
mechanism (->done) for that.  Initialize the wait event only once and
reinitialize before adding memory.  Unconditionally call complete() and
wait_for_completion_timeout().

If there are no waiters, complete() will only increment ->done - which
will be reset by reinit_completion().  If complete() has already been
called, wait_for_completion_timeout() will not wait.

There is still the chance for a small race between concurrent
reinit_completion() and complete().  If complete() wins, we would not wait
- which is tolerable (and the race exists in current code as well).

Note: We only wait for "some" memory to get onlined, which seems to be
      good enough for now.

[akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David]
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8b1847f3

drivers/base/memory: store mapping between MMOP_* and string in an array · 34c41cb9

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit 4dc8207b
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Let's use a simple array which we can reuse soon.  While at it, move the
string->mmop conversion out of the device hotplug lock.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

34c41cb9

drivers/base/memory: map MMOP_OFFLINE to 0 · 2a0023a0

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit efc978ad
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Historically, we used the value -1.  Just treat 0 as the special case now.
Clarify a comment (which was wrong, when we come via device_online() the
first time, the online_type would have been 0 / MEM_ONLINE).  The default
is now always MMOP_OFFLINE.  This removes the last user of the manual
"-1", which didn't use the enum value.

This is a preparation to use the online_type as an array index.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Yumei Huang <yuhuang@redhat.com>
Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2a0023a0

drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE · fb20288d

由 David Hildenbrand 提交于 2月 09, 2022

mainline inclusion
from linux-5.7-rc1
commit 956f8b44
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
CVE: NA

--------------------------------

Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.

Distributions nowadays use udev rules ([1] [2]) to specify if and how to
online hotplugged memory.  The rules seem to get more complex with many
special cases.  Due to the various special cases,
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used.  All memory hotplug
is handled via udev rules.

Every time we hotplug memory, the udev rule will come to the same
conclusion.  Especially Hyper-V (but also soon virtio-mem) add a lot of
memory in separate memory blocks and wait for memory to get onlined by
user space before continuing to add more memory blocks (to not add memory
faster than it is getting onlined).  This of course slows down the whole
memory hotplug process.

To make the job of distributions easier and to avoid udev rules that get
more and more complicated, let's extend the mechanism provided by
- /sys/devices/system/memory/auto_online_blocks
- "memhp_default_state=" on the kernel cmdline
to be able to specify also "online_movable" as well as "online_kernel"
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>

=== Example /usr/libexec/config-memhotplug ===

#!/bin/bash

VIRT=`systemd-detect-virt --vm`
ARCH=`uname -p`

sense_virtio_mem() {
  if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
    DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
    if [ $DEVICES != "0" ]; then
        return 0
    fi
  fi
  return 1
}

if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
  echo "Memory hotplug configuration support missing in the kernel"
  exit 1
fi

if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
  echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
  exit 1
fi

if [ $VIRT == "microsoft" ]; then
  echo "Detected Hyper-V on $ARCH"
  # Hyper-V wants all memory in ZONE_NORMAL
  ONLINE_TYPE="online_kernel"
elif sense_virtio_mem; then
  echo "Detected virtio-mem on $ARCH"
  # virtio-mem wants all memory in ZONE_NORMAL
  ONLINE_TYPE="online_kernel"
elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
  echo "Detected $ARCH"
  # standby memory should not be onlined automatically
  ONLINE_TYPE="offline"
elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
  echo "Detected" $ARCH
  # PPC64 onlines all hotplugged memory right from the kernel
  ONLINE_TYPE="offline"
elif [ $VIRT == "none" ]; then
  echo "Detected bare-metal on $ARCH"
  # Bare metal users expect hotplugged memory to be unpluggable. We assume
  # that ZONE imbalances on such enterpise servers cannot happen and is
  # properly documented
  ONLINE_TYPE="online_movable"
else
  # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
  # imbalances won't happen
  echo "Detected $VIRT on $ARCH"
  # Usually, ballooning is used in virtual environments, so memory should go to
  # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
  ONLINE_TYPE="online"
fi

echo "Selected online_type:" $ONLINE_TYPE

# Configure what to do with memory that will be hotplugged in the future
echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
if [ $? != "0" ]; then
  echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
  # A backup udev rule should handle old kernels if necessary
  exit 1
fi

# Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
if [ $ONLINE_TYPE != "offline" ]; then
  for MEMORY in /sys/devices/system/memory/memory*; do
    STATE=`cat $MEMORY/state`
    if [ $STATE == "offline" ]; then
        echo $ONLINE_TYPE > $MEMORY/state
    fi
  done
fi

=== Example /usr/lib/systemd/system/config-memhotplug.service ===

[Unit]
Description=Configure memory hotplug behavior
DefaultDependencies=no
Conflicts=shutdown.target
Before=sysinit.target shutdown.target
After=systemd-modules-load.service
ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks

[Service]
ExecStart=/usr/libexec/config-memhotplug
Type=oneshot
TimeoutSec=0
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

=== Example modification to the 40-redhat.rules [2] ===

: diff --git a/40-redhat.rules b/40-redhat.rules-new
: index 2c690e5..168fd03 100644
: --- a/40-redhat.rules
: +++ b/40-redhat.rules-new
: @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
:  # Memory hotadd request
:  SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
:  ACTION!="add", GOTO="memory_hotplug_end"
: +# memory hotplug behavior configured
: +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
: +
:  PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
:
:  ENV{.state}="online"

===

[1] https://github.com/lnykryn/systemd-rhel/pull/281
[2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules

This patch (of 8):

The name is misleading and it's not really clear what is "kept".  Let's
just name it like the online_type name we expose to user space ("online").

Add some documentation to the types.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Yumei Huang <yuhuang@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Eduardo Habkost <ehabkost@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Wei Liu <wei.liu@kernel.org>
Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fb20288d

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功