- 09 2月, 2022 40 次提交
-
-
由 Guo Mengqi 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SPNL CVE: NA ----------------------------------- In sp_mmap(), if use offset = va - MMAP_BASE/DVPP_BASE, then normal sp_alloc pgoff may have same value with DVPP pgoff, causing DVPP and sp_alloc mapped to overlapped part of file unexpectedly. To fix the problem, pass VA value as mmap offset, for in this scenario, VA value in one task address space will not be same. Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yuan Can 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4S786 CVE: NA ------------------------------------------------------- Signed-off-by: NYuan Can <yuancan@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wang Wensheng 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8 CVE: NA ------------------------------------------------- We use device_id to select the correct dvpp vspace range when SP_DVPP flag is specified. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wang Wensheng 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8 CVE: NA ------------------------------------------------- Device_id is used for DVPP to select the correct virtual address space and node_id is used to specify the node where we want to alloc physical memory from. Those two don't have to be the same in theory. Actually, the process runs always on the numa nodes corresponding to the device the process used and the node with the same id as the device is always belongs the the device. So using device_id as node_id to alloc memory could work. However the number of numa nodes belongs to a specified device is not always one and we cannot use other numa nodes of the device. Here we introduce a new flag SP_SPEC_NODE_ID and add a bit-region in sp_flags for those who want to use other nodes belongs to a device. That is, if one want to specify the node_id, the new flag and the node_id should be both added to the sp_flags when calling sp_alloc() or sp_make_share_k2u(), otherwise the node with the same id as the device would be in use. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wang Wensheng 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8 CVE: NA ------------------------------------------------- The maximum devices supported in share_pool in static. Here we make it extendable for later use. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wang Wensheng 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4SON8 CVE: NA ------------------------------------------------- MAP_SHARE_POOL and MAP_FIXED_NOREPLACE have the same value. Redefine MAP_SHARE_POOL to fix it. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yang Yingliang 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Enable CONFIG_MEMORY_RELIABLE CONFIG_CLEAR_FREELIST_PAGE and CONFIG_EFI_FAKE_MEMMAP for test Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yu Liao 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- This patch add sysctl to clear pages in free lists of each NUMA node. For each NUMA node, clear each page in the free list, these work is scheduled on a random CPU of the NUMA node. When kasan is enabled and the pages are free, the shadow memory will be filled with 0xFF, writing these free pages will cause UAF, so just disable KASAN for clear freelist. Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alexander Duyck 提交于
stable inclusion from stable-5.10.88 commit 8204e0c1 bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S -------------------------------- Provide a new function, queue_work_node, which is meant to schedule work on a "random" CPU of the requested NUMA node. The main motivation for this is to help assist asynchronous init to better improve boot times for devices that are local to a specific node. For now we just default to the first CPU that is in the intersection of the cpumask of the node and the online cpumask. The only exception is if the CPU is local to the node we will just use the current CPU. This should work for our purposes as we are currently only using this for unbound work so the CPU will be translated to a node anyway instead of being directly used. As we are only using the first CPU to represent the NUMA node for now I am limiting the scope of the function so that it can only be used with unbound workqueues. Acked-by: NTejun Heo <tj@kernel.org> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Acked-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Function shrink_shepherd is used to queue each work on cpu to shrink page cache, and it will be called periodically, but if there is no page_cache_over_limit check before shrink page cache, it will result in periodic memory reclamation even the number of page cache below limit, so add basic check before shrink page cache. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Add two proc interfaces to set page cache limit. Both vm_cache_limit_mbytes and vm_cache_limit_ratio will be update when writing either of the two interfaces. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- If page cache is over limit, it will trigger page cache reclaimation, only page cache should be reclaimed, but slab will be reclaimed by default in shrink_node, so disable shrink_slab by adding a control parameter in scan_control. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- The number of page cache should be limited in a range if enable CONFIG_MEMORY_RELIABLE, so only page cache instead of both file + anono page should be reclaimed during page cache reclaimtion. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- The reason of disable shrink_page_cache in add_to_page_cache are: 1. Synchronous memory reclamation will affect performance. 2. add_to_page_cache will not increase the number of LRU size in HugeTLB situation, so shrink_page_cache will not be triggered. Now, add_to_page_cache in mm/filemap.c and include/linux/pagemap.h are same, don't delete add_to_page_cache in mm/filemap.c, just keep interface for KABI. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------------------- In function cache_limit_ratio_sysctl_handler and cache_limit_mbytes_sysctl_handler, it will shrink page cache even if vm_cache_reclaim_enable is false, it is unexpected. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- If vmcache_reclaim_s > 120, it will call shrink_page_cache_work after 120 seconds even shrinking is hard, that is shorter than vmcache_reclaim_s, deviating from the original intention of extending the interval. In order to solve this, shrink_page_cache_work should be call after vmcache_reclaim_s + 120. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Item "FileCache" in /proc/meminfo show the number of page cache in LRU(active + inactive). Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Add page cache fallback statistic, the counter will overflow after a period time of use, only reset to zero, no negative effect. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Add cmdline for the reliable memory usage of page cache. Page cache will not use reliable memory when passing option "P" to reliable_debug in cmdline. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chen Wandun 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- __page_cache_alloc is used to alloc page cache in most file system, such as ext4, f2fs, so add ___GFP_RELIABILITY flag to support feature CONFIG_MEMORY_RELIABLE when alloc page. Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------ Add ReliableShmem in /proc/meminfo to show reliable memory info used by shmem. - ReliableShmem: reliable memory used by shmem Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------ This feature depends on the overall memory reliable feature. When the shared memory reliable feature is enabled, the pages used by the shared memory are allocated from the mirrored region by default. If the mirrored region is insufficient, you can allocate resources from the non-mirrored region. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Introduce fallback mechanism for memory reliable. The following process will fallback to non-mirrored region if their allocation from mirrored region failed - User tasks with reliable flag - thp collapse pages - init tasks - pagecache - tmpfs In order to achieve this goals. Buddy system will fallback to non-mirrored in the following situations. - if __GFP_THISNODE is set in gfp_mask and dest nodes do not have any zones available - high_zoneidx will be set to ZONE_MOVABLE to alloc memory before oom This mechanism is enabled by defalut and can be disabled by adding "reliable_debug=F" to the kernel parameters. This mechanism rely on CONFIG_MEMORY_RELIABLE and need "kernelcore=reliable" in the kernel parameters. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ---------------------------------------------- there is a upper limit for special user tasks's memory allocation. special user task means user task with reliable flag. Init tasks will alloc memory from non-mirrored region if their allocation trigger limit. The limit can be set or access via /proc/sys/vm/task_reliable_limit This limit's default value is ULONG_MAX. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Hugepaged collapse pages into huge page will use the same memory region. When hugepaged collapse pages into huge page, hugepaged will check if there is any reliable pages in the area to be collapsed. If this area contains any reliable pages, hugepaged will alloc memory from mirrored region. Otherwise it will alloc momory from non-mirrored region. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ---------------------------------------------- Counting reliable memory allocated by the reliable user tasks. The policy of counting reliable memory usage is based on RSS statistics. Anywhere with counter of mm need count reliable pages too. Reliable page which is checked by page_reliable() need to update the reliable page counter by calling reliable_page_counter(). Updating the reliable pages should be considered if the following logic is added: - add_mm_counter - dec_mm_counter - inc_mm_counter_fast - dec_mm_counter_fast - rss[mm_counter(page)] Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------------- Adding an variable in mm_struct for accouting the amount of reliable memory allocated by the reliable user tasks. Use KABI_RESERVE(3) in mm_struct to avoid any kapi changes. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA ------------------------------------------ Adding reliable flag for user task. User task with reliable flag can only alloc memory from mirrored region. PF_RELIABLE is added to represent the task's reliable flag. - For init task, which is regarded as as special task which alloc memory from mirrored region. - For normal user tasks, The reliable flag can be set via procfs interface shown as below and can be inherited via fork(). User can change a user task's reliable flag by $ echo [0/1] > /proc/<pid>/reliable and check a user task's reliable flag by $ cat /proc/<pid>/reliable Note, global init task's reliable file can not be accessed. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Add ReliMemTotal & ReliMemUsed in /proc/meminfo to show memory info about reliable memory. - ReliableTotal: total reliable RAM - ReliableUsed: thei used amount of reliable memory kernel Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Introduction Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> ============ Memory reliable feature is a memory tiering mechanism. It is based on kernel mirror feature, which splits memory into two sperate regions, mirrored(reliable) region and non-mirrored (non-reliable) region. for kernel mirror feature: - allocate kernel memory from mirrored region by default - allocate user memory from non-mirrored region by default non-mirrored region will be arranged into ZONE_MOVABLE. for kernel reliable feature, it has additional features below: - normal user tasks never alloc memory from mirrored region with userspace apis(malloc, mmap, etc.) - special user tasks will allocate memory from mirrored region by default - tmpfs/pagecache allocate memory from mirrored region by default - upper limit of mirrored region allcated for user tasks, tmpfs and pagecache Support Reliable fallback mechanism which allows special user tasks, tmpfs and pagecache can fallback to alloc non-mirrored region, it's the default setting. In order to fulfil the goal - ___GFP_RELIABILITY flag added for alloc memory from mirrored region. - the high_zoneidx for special user tasks/tmpfs/pagecache is set to ZONE_NORMAL. - normal user tasks could only alloc from ZONE_MOVABLE. This patch is just the main framework, memory reliable support for special user tasks, pagecache and tmpfs has own patches. To enable this function, mirrored(reliable) memory is needed and "kernelcore=reliable" should be added to kernel parameters. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Mirrored memory could be used on HiSilion's arm64 SoC. So efi_find_mirror() is added in efi_init() so that systems can get memblock about any mirrored ranges. Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Commit b05b9f5f ("x86, mirror: x86 enabling - find mirrored memory ranges") introduce the efi_find_mirror function on x86. In order to reuse the API we make it public in preparation for arm64 to support mirrord memory. Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Fake memory map is used for faking memory's attribute values. Commit 0f96a99d ("efi: Add "efi_fake_mem" boot option") introduce the efi_fake_mem function. Now it can support arm64 with this patch. For example you can mark 0-6G memory as EFI_MEMORY_MORE_RELIABLE by adding efi_fake_mem=6G@0:0x10000 in the bootarg. You find more info about fake memmap in kernel-parameters.txt. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Make efi_print_memmap() public in preparation for adding fake memory support for architecture with efi support, eg, arm64. Co-developed-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit 5f47adf7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- For now, distributions implement advanced udev rules to essentially - Don't online any hotplugged memory (s390x) - Online all memory to ZONE_NORMAL (e.g., most virt environments like hyperv) - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken care of (e.g., bare metal, special virt environments) In summary: All memory is usually onlined the same way, however, the kernel always has to ask user space to come up with the same answer. E.g., Hyper-V always waits for a memory block to get onlined before continuing, otherwise it might end up adding memory faster than onlining it, which can result in strange OOM situations. This waiting slows down adding of a bigger amount of memory. Let's allow to specify a default online_type, not just "online" and "offline". This allows distributions to configure the default online_type when booting up and be done with it. We can now specify "offline", "online", "online_movable" and "online_kernel" via - "memhp_default_state=" on the kernel cmdline - /sys/devices/system/memory/auto_online_blocks just like we are able to specify for a single memory block via /sys/devices/system/memory/memoryX/state Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit 862919e5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- ... and rename it to memhp_default_online_type. This is a preparation for more detailed default online behavior. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [Wupeng: keep memhp_auto_online for kabi] Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit bc58ebd5 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- We get the MEM_ONLINE notifier call if memory is added right from the kernel via add_memory() or later from user space. Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt mechanism (->done) for that. Initialize the wait event only once and reinitialize before adding memory. Unconditionally call complete() and wait_for_completion_timeout(). If there are no waiters, complete() will only increment ->done - which will be reset by reinit_completion(). If complete() has already been called, wait_for_completion_timeout() will not wait. There is still the chance for a small race between concurrent reinit_completion() and complete(). If complete() wins, we would not wait - which is tolerable (and the race exists in current code as well). Note: We only wait for "some" memory to get onlined, which seems to be good enough for now. [akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David] Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit 4dc8207b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Let's use a simple array which we can reuse soon. While at it, move the string->mmop conversion out of the device hotplug lock. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit efc978ad category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Historically, we used the value -1. Just treat 0 as the special case now. Clarify a comment (which was wrong, when we come via device_online() the first time, the online_type would have been 0 / MEM_ONLINE). The default is now always MMOP_OFFLINE. This removes the last user of the manual "-1", which didn't use the enum value. This is a preparation to use the online_type as an array index. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from linux-5.7-rc1 commit 956f8b44 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S CVE: NA -------------------------------- Patch series "mm/memory_hotplug: allow to specify a default online_type", v3. Distributions nowadays use udev rules ([1] [2]) to specify if and how to online hotplugged memory. The rules seem to get more complex with many special cases. Due to the various special cases, CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug is handled via udev rules. Every time we hotplug memory, the udev rule will come to the same conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of memory in separate memory blocks and wait for memory to get onlined by user space before continuing to add more memory blocks (to not add memory faster than it is getting onlined). This of course slows down the whole memory hotplug process. To make the job of distributions easier and to avoid udev rules that get more and more complicated, let's extend the mechanism provided by - /sys/devices/system/memory/auto_online_blocks - "memhp_default_state=" on the kernel cmdline to be able to specify also "online_movable" as well as "online_kernel" Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> === Example /usr/libexec/config-memhotplug === #!/bin/bash VIRT=`systemd-detect-virt --vm` ARCH=`uname -p` sense_virtio_mem() { if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l` if [ $DEVICES != "0" ]; then return 0 fi fi return 1 } if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then echo "Memory hotplug configuration support missing in the kernel" exit 1 fi if grep "memhp_default_state=" /proc/cmdline > /dev/null; then echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)" exit 1 fi if [ $VIRT == "microsoft" ]; then echo "Detected Hyper-V on $ARCH" # Hyper-V wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif sense_virtio_mem; then echo "Detected virtio-mem on $ARCH" # virtio-mem wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then echo "Detected $ARCH" # standby memory should not be onlined automatically ONLINE_TYPE="offline" elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then echo "Detected" $ARCH # PPC64 onlines all hotplugged memory right from the kernel ONLINE_TYPE="offline" elif [ $VIRT == "none" ]; then echo "Detected bare-metal on $ARCH" # Bare metal users expect hotplugged memory to be unpluggable. We assume # that ZONE imbalances on such enterpise servers cannot happen and is # properly documented ONLINE_TYPE="online_movable" else # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE # imbalances won't happen echo "Detected $VIRT on $ARCH" # Usually, ballooning is used in virtual environments, so memory should go to # ZONE_NORMAL. However, sometimes "movable_node" is relevant. ONLINE_TYPE="online" fi echo "Selected online_type:" $ONLINE_TYPE # Configure what to do with memory that will be hotplugged in the future echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks if [ $? != "0" ]; then echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)" # A backup udev rule should handle old kernels if necessary exit 1 fi # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem) if [ $ONLINE_TYPE != "offline" ]; then for MEMORY in /sys/devices/system/memory/memory*; do STATE=`cat $MEMORY/state` if [ $STATE == "offline" ]; then echo $ONLINE_TYPE > $MEMORY/state fi done fi === Example /usr/lib/systemd/system/config-memhotplug.service === [Unit] Description=Configure memory hotplug behavior DefaultDependencies=no Conflicts=shutdown.target Before=sysinit.target shutdown.target After=systemd-modules-load.service ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks [Service] ExecStart=/usr/libexec/config-memhotplug Type=oneshot TimeoutSec=0 RemainAfterExit=yes [Install] WantedBy=sysinit.target === Example modification to the 40-redhat.rules [2] === : diff --git a/40-redhat.rules b/40-redhat.rules-new : index 2c690e5..168fd03 100644 : --- a/40-redhat.rules : +++ b/40-redhat.rules-new : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online} : # Memory hotadd request : SUBSYSTEM!="memory", GOTO="memory_hotplug_end" : ACTION!="add", GOTO="memory_hotplug_end" : +# memory hotplug behavior configured : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end" : + : PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" : : ENV{.state}="online" === [1] https://github.com/lnykryn/systemd-rhel/pull/281 [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules This patch (of 8): The name is misleading and it's not really clear what is "kept". Let's just name it like the online_type name we expose to user space ("online"). Add some documentation to the types. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NBaoquan He <bhe@redhat.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Yumei Huang <yuhuang@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-