- 18 11月, 2021 40 次提交
-
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit ee801b7d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- DAMON-based operation schemes need to be manually turned on and off. In some use cases, however, the condition for turning a scheme on and off would depend on the system's situation. For example, schemes for proactive pages reclamation would need to be turned on when some memory pressure is detected, and turned off when the system has enough free memory. For easier control of schemes activation based on the system situation, this introduces a watermarks-based mechanism. The client can describe the watermark metric (e.g., amount of free memory in the system), watermark check interval, and three watermarks, namely high, mid, and low. If the scheme is deactivated, it only gets the metric and compare that to the three watermarks for every check interval. If the metric is higher than the high watermark, the scheme is deactivated. If the metric is between the mid watermark and the low watermark, the scheme is activated. If the metric is lower than the low watermark, the scheme is deactivated again. This is to allow users fall back to traditional page-granularity mechanisms. Link: https://lkml.kernel.org/r/20211019150731.16699-12-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit ee801b7d) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 5a0d6a08 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This updates the DAMON selftests for 'schemes' debugfs file, as the file format is updated. Link: https://lkml.kernel.org/r/20211019150731.16699-11-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5a0d6a08) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit f4a68b4a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This allows DAMON debugfs interface users set the prioritization weights by putting three more numbers to the 'schemes' file. Link: https://lkml.kernel.org/r/20211019150731.16699-10-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f4a68b4a) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 198f0f4c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes the default monitoring primitives for virtual address spaces and the physical address sapce to support memory regions prioritization for 'PAGEOUT' DAMOS action. It calculates hotness of each region as weighted sum of 'nr_accesses' and 'age' of the region and get the priority score as reverse of the hotness, so that cold regions can be paged out first. Link: https://lkml.kernel.org/r/20211019150731.16699-9-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 198f0f4c) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 38683e00 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes DAMON apply schemes to regions having higher priority first, if it cannot apply schemes to all regions due to the quotas. The prioritization function should be implemented in the monitoring primitives. Those would commonly calculate the priority of the region using attributes of regions, namely 'size', 'nr_accesses', and 'age'. For example, some primitive would calculate the priority of each region using a weighted sum of 'nr_accesses' and 'age' of the region. The optimal weights would depend on give environments, so this makes those customizable. Nevertheless, the score calculation functions are only encouraged to respect the weights, not mandated. Link: https://lkml.kernel.org/r/20211019150731.16699-8-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 38683e00) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit a2cb4dd0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This updates DAMON selftests to support updated schemes debugfs file format for the quotas. Link: https://lkml.kernel.org/r/20211019150731.16699-7-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a2cb4dd0) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit d7d0ec85 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes the debugfs interface of DAMON support the scheme quotas by chaning the format of the input for the schemes file. Link: https://lkml.kernel.org/r/20211019150731.16699-6-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d7d0ec85) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 1cd24303 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- The size quota feature of DAMOS is useful for IO resource-critical systems, but not so intuitive for CPU time-critical systems. Systems using zram or zswap-like swap device would be examples. To provide another intuitive ways for such systems, this implements time-based quota for DAMON-based Operation Schemes. If the quota is set, DAMOS tries to use only up to the user-defined quota of CPU time within a given time window. Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1cd24303) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 50585192 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- If DAMOS has stopped applying action in the middle of a group of memory regions due to its size quota, it starts the work again from the beginning of the address space in the next charge window. If there is a huge memory region at the beginning of the address space and it fulfills the scheme's target data access pattern always, the action will applied to only the region. This mitigates the case by skipping memory regions that charged in current charge window at the beginning of next charge window. Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 50585192) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 2b8a248d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- There could be arbitrarily large memory regions fulfilling the target data access pattern of a DAMON-based operation scheme. In the case, applying the action of the scheme could incur too high overhead. To provide an intuitive way for avoiding it, this implements a feature called size quota. If the quota is set, DAMON tries to apply the action only up to the given amount of memory regions within a given time window. Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2b8a248d) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 57223ac2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Introduction Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> ============ This patchset 1) makes the engine for general data access pattern-oriented memory management (DAMOS) be more useful for production environments, and 2) implements a static kernel module for lightweight proactive reclamation using the engine. Proactive Reclamation --------------------- On general memory over-committed systems, proactively reclaiming cold pages helps saving memory and reducing latency spikes that incurred by the direct reclaim or the CPU consumption of kswapd, while incurring only minimal performance degradation[2]. A Free Pages Reporting[8] based memory over-commit virtualization system would be one more specific use case. In the system, the guest VMs reports their free memory to host, and the host reallocates the reported memory to other guests. As a result, the system's memory utilization can be maximized. However, the guests could be not so memory-frugal, because some kernel subsystems and user-space applications are designed to use as much memory as available. Then, guests would report only small amount of free memory to host, results in poor memory utilization. Running the proactive reclamation in such guests could help mitigating this problem. Google has also implemented this idea and using it in their data center. They further proposed upstreaming it in LSFMM'19, and "the general consensus was that, while this sort of proactive reclaim would be useful for a number of users, the cost of this particular solution was too high to consider merging it upstream"[3]. The cost mainly comes from the coldness tracking. Roughly speaking, the implementation periodically scans the 'Accessed' bit of each page. For the reason, the overhead linearly increases as the size of the memory and the scanning frequency grows. As a result, Google is known to dedicating one CPU for the work. That's a reasonable option to someone like Google, but it wouldn't be so to some others. DAMON and DAMOS: An engine for data access pattern-oriented memory management ----------------------------------------------------------------------------- DAMON[4] is a framework for general data access monitoring. Its adaptive monitoring overhead control feature minimizes its monitoring overhead. It also let the upper-bound of the overhead be configurable by clients, regardless of the size of the monitoring target memory. While monitoring 70 GiB memory of a production system every 5 milliseconds, it consumes less than 1% single CPU time. For this, it could sacrify some of the quality of the monitoring results. Nevertheless, the lower-bound of the quality is configurable, and it uses a best-effort algorithm for better quality. Our test results[5] show the quality is practical enough. From the production system monitoring, we were able to find a 4 KiB region in the 70 GiB memory that shows highest access frequency. We normally don't monitor the data access pattern just for fun but to improve something like memory management. Proactive reclamation is one such usage. For such general cases, DAMON provides a feature called DAMon-based Operation Schemes (DAMOS)[6]. It makes DAMON an engine for general data access pattern oriented memory management. Using this, clients can ask DAMON to find memory regions of specific data access pattern and apply some memory management action (e.g., page out, move to head of the LRU list, use huge page, ...). We call the request 'scheme'. Proactive Reclamation on top of DAMON/DAMOS ------------------------------------------- Therefore, by using DAMON for the cold pages detection, the proactive reclamation's monitoring overhead issue can be solved. Actually, we previously implemented a version of proactive reclamation using DAMOS and achieved noticeable improvements with our evaluation setup[5]. Nevertheless, it more for a proof-of-concept, rather than production uses. It supports only virtual address spaces of processes, and require additional tuning efforts for given workloads and the hardware. For the tuning, we introduced a simple auto-tuning user space tool[8]. Google is also known to using a ML-based similar approach for their fleets[2]. But, making it just works with intuitive knobs in the kernel would be helpful for general users. To this end, this patchset improves DAMOS to be ready for such production usages, and implements another version of the proactive reclamation, namely DAMON_RECLAIM, on top of it. DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks -------------------------------------------------------------------------- First of all, the current version of DAMOS supports only virtual address spaces. This patchset makes it supports the physical address space for the page out action. Next major problem of the current version of DAMOS is the lack of the aggressiveness control, which can results in arbitrary overhead. For example, if huge memory regions having the data access pattern of interest are found, applying the requested action to all of the regions could incur significant overhead. It can be controlled by tuning the target data access pattern with manual or automated approaches[2,7]. But, some people would prefer the kernel to just work with only intuitive tuning or default values. For such cases, this patchset implements a safeguard, namely time/size quota. Using this, the clients can specify up to how much time can be used for applying the action, and/or up to how much memory regions the action can be applied within a user-specified time duration. A followup question is, to which memory regions should the action applied within the limits? We implement a simple regions prioritization mechanism for each action and make DAMOS to apply the action to high priority regions first. It also allows clients tune the prioritization mechanism to use different weights for size, access frequency, and age of memory regions. This means we could use not only LRU but also LFU or some fancy algorithms like CAR[9] with lightweight overhead. Though DAMON is lightweight, someone would want to remove even the cold pages monitoring overhead when it is unnecessary. Currently, it should manually turned on and off by clients, but some clients would simply want to turn it on and off based on some metrics like free memory ratio or memory fragmentation. For such cases, this patchset implements a watermarks-based automatic activation feature. It allows the clients configure the metric of their interest, and three watermarks of the metric. If the metric is higher than the high watermark or lower than the low watermark, the scheme is deactivated. If the metric is lower than the mid watermark but higher than the low watermark, the scheme is activated. DAMON-based Reclaim ------------------- Using the improved version of DAMOS, this patchset implements a static kernel module called 'damon_reclaim'. It finds memory regions that didn't accessed for specific time duration and page out. Consuming too much CPU for the paging out operations, or doing pageout too frequently can be critical for systems configuring their swap devices with software-defined in-memory block devices like zram/zswap or total number of writes limited devices like SSDs, respectively. To avoid the problems, the time/size quotas can be configured. Under the quotas, it pages out memory regions that didn't accessed longer first. Also, to remove the monitoring overhead under peaceful situation, and to fall back to the LRU-list based page granularity reclamation when it doesn't make progress, the three watermarks based activation mechanism is used, with the free memory ratio as the watermark metric. For convenient configurations, it provides several module parameters. Using these, sysadmins can enable/disable it, and tune its parameters including the coldness identification time threshold, the time/size quotas and the three watermarks. Evaluation ========== In short, DAMON_RECLAIM with 50ms/s time quota and regions prioritization on v5.15-rc5 Linux kernel with ZRAM swap device achieves 38.58% memory saving with only 1.94% runtime overhead. For this, DAMON_RECLAIM consumes only 4.97% of single CPU time. Setup ----- We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements make effect. For this, we measure DAMON_RECLAIM's CPU consumption, entire system memory footprint, total number of major page faults, and runtime of 24 realistic workloads in PARSEC3 and SPLASH-2X benchmark suites on my QEMU/KVM based virtual machine. The virtual machine runs on an i3.metal AWS instance, has 130GiB memory, and runs a linux kernel built on latest -mm tree[1] plus this patchset. It also utilizes a 4 GiB ZRAM swap device. We repeats the measurement 5 times and use averages. [1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55 Detailed Results ---------------- The results are summarized in the below table. With coldness identification threshold of 5 seconds, DAMON_RECLAIM without the time quota-based speed limit achieves 47.21% memory saving, but incur 4.59% runtime slowdown to the workloads on average. For this, DAMON_RECLAIM consumes about 11.28% single CPU time. Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%, respectively. Time quota of 200ms/s (20%) makes no real change compared to the quota unapplied version, because the quota unapplied version consumes only 11.28% CPU time. DAMON_RECLAIM's CPU utilization also similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time. That is, the overhead is proportional to the speed limit. Nevertheless, it also reduces the memory saving because it becomes less aggressive. In detail, the three variants show 48.76%, 37.83%, and 7.85% memory saving, respectively. Applying the regions prioritization (page out regions that not accessed longer first within the time quota) further reduces the performance degradation. Runtime slowdowns and total number of major page faults increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s), 2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% -> 2.08%/8,781.75% (10ms/s). The runtime under 10ms/s time quota has increased with prioritization, but apparently that's under the margin of error. time quota prioritization memory_saving cpu_util slowdown pgmajfaults overhead N N 47.21% 11.28% 4.59% 194,802% 200ms/s N 48.76% 11.24% 4.89% 218,690% 50ms/s N 37.83% 5.51% 2.65% 111,886% 10ms/s N 7.85% 2.01% 1.5% 34,793.40% 200ms/s Y 50.08% 10.38% 4.39% 166,136% 50ms/s Y 38.58% 4.97% 1.94% 59,053% 10ms/s Y 3.63% 1.73% 2.08% 8,781.75% Baseline and Complete Git Trees =============================== The patches are based on the latest -mm tree (v5.15-rc5-mmots-2021-10-13-19-55). You can also clone the complete git tree from: $ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1 The web is also available: https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1 Sequence Of Patches =================== The first patch makes DAMOS support the physical address space for the page out action. Following five patches (patches 2-6) implement the time/size quotas. Next four patches (patches 7-10) implement the memory regions prioritization within the limit. Then, three following patches (patches 11-13) implement the watermarks-based schemes activation. Finally, the last two patches (patches 14-15) implement and document the DAMON-based reclamation using the advanced DAMOS. [1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html [2] https://research.google/pubs/pub48551/ [3] https://lwn.net/Articles/787611/ [4] https://damonitor.github.io [5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html [6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/ [7] https://github.com/awslabs/damoos [8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html [9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement This patch (of 15): This makes the DAMON primitives for physical address space support the pageout action for DAMON-based Operation Schemes. With this commit, hence, users can easily implement system-level data access-aware reclamations using DAMOS. [sj@kernel.org: fix missing-prototype build warning] Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Marco Elver <elver@google.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Greg Thelen <gthelen@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 57223ac2) Signed-off-by: NYue Zou <zouyue3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Rongwei Wang 提交于
mainline inclusion from mainline-5.16 commit 9210622a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- In some functions, it's unnecessary to declare 'err' and 'ret' variables at the same time. This patch mainly to simplify the issue of such declarations by reusing one variable. Link: https://lkml.kernel.org/r/20211014073014.35754-1-sj@kernel.orgSigned-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com> Signed-off-by: NSeongJae Park <sj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9210622a) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Rikard Falkeborn 提交于
mainline inclusion from mainline-5.16 commit 199b50f4 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- The only usage of these structs is to pass their addresses to walk_page_range(), which takes a pointer to const mm_walk_ops as argument. Make them const to allow the compiler to put them in read-only memory. Link: https://lkml.kernel.org/r/20211014075042.17174-2-rikard.falkeborn@gmail.comSigned-off-by: NRikard Falkeborn <rikard.falkeborn@gmail.com> Reviewed-by: NSeongJae Park <sj@kernel.org> Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 199b50f4) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit c6380721 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This updates the DAMON documents for the physical memory address space monitoring support. Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit c6380721) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit c026291a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes the 'damon-dbgfs' to support the physical memory monitoring, in addition to the virtual memory monitoring. Users can do the physical memory monitoring by writing a special keyword, 'paddr' to the 'target_ids' debugfs file. Then, DAMON will check the special keyword and configure the monitoring context to run with the primitives for the physical address space. Unlike the virtual memory monitoring, the monitoring target region will not be automatically set. Therefore, users should also set the monitoring target address region using the 'init_regions' debugfs file. Also, note that the physical memory monitoring will not automatically terminated. The user should explicitly turn off the monitoring by writing 'off' to the 'monitor_on' debugfs file. Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit c026291a) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit a28397be category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This implements the monitoring primitives for the physical memory address space. Internally, it uses the PTE Accessed bit, similar to that of the virtual address spaces monitoring primitives. It supports only user memory pages, as idle pages tracking does. If the monitoring target physical memory address range contains non-user memory pages, access check of the pages will do nothing but simply treat the pages as not accessed. Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit a28397be) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 46c3a0ac category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This moves functions in the default virtual address spaces monitoring primitives that commonly usable from other address spaces like physical address space into a header file. Those will be reused by the physical address space monitoring primitives which will be implemented by the following commit. [sj@kernel.org: include 'highmem.h' to fix a build failure] Link: https://lkml.kernel.org/r/20211014110848.5204-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211012205711.29216-5-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 46c3a0ac) Conflicts: mm/damon/vaddr.c (need to include <linux/sched/mm.h> for get_task_mm/mmput function) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit c2fe4987 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This adds description of the 'init_regions' feature in the DAMON usage document. Link: https://lkml.kernel.org/r/20211012205711.29216-4-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit c2fe4987) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 1c2e11bf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This adds another test case for the new feature, 'init_regions'. Link: https://lkml.kernel.org/r/20211012205711.29216-3-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Reviewed-by: NBrendan Higgins <brendanhiggins@google.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1c2e11bf) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 90bebce9 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Patch series "DAMON: Support Physical Memory Address Space Monitoring:. DAMON currently supports only virtual address spaces monitoring. It can be easily extended for various use cases and address spaces by configuring its monitoring primitives layer to use appropriate primitives implementations, though. This patchset implements monitoring primitives for the physical address space monitoring using the structure. The first 3 patches allow the user space users manually set the monitoring regions. The 1st patch implements the feature in the 'damon-dbgfs'. Then, patches for adding a unit tests (the 2nd patch) and updating the documentation (the 3rd patch) follow. Following 4 patches implement the physical address space monitoring primitives. The 4th patch makes some primitive functions for the virtual address spaces primitives reusable. The 5th patch implements the physical address space monitoring primitives. The 6th patch links the primitives to the 'damon-dbgfs'. Finally, 7th patch documents this new features. This patch (of 7): Some 'damon-dbgfs' users would want to monitor only a part of the entire virtual memory address space. The program interface users in the kernel space could use '->before_start()' callback or set the regions inside the context struct as they want, but 'damon-dbgfs' users cannot. For that reason, this introduces a new debugfs file called 'init_region'. 'damon-dbgfs' users can specify which initial monitoring target address regions they want by writing special input to the file. The input should describe each region in each line in the below form: <pid> <start address> <end address> Note that the regions will be updated to cover entire memory mapped regions after a 'regions update interval' is passed. If you want the regions to not be updated after the initial setting, you could set the interval as a very long time, say, a few decades. Link: https://lkml.kernel.org/r/20211012205711.29216-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211012205711.29216-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Marco Elver <elver@google.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Greg Thelen <gthelen@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: David Rienjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 90bebce9) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 68536f8e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This adds the description of DAMON-based operation schemes in the DAMON documents. Link: https://lkml.kernel.org/r/20211001125604.29660-8-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 68536f8e) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 8d5d4c63 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This adds simple selftets for 'schemes' debugfs file of DAMON. Link: https://lkml.kernel.org/r/20211001125604.29660-7-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 8d5d4c63) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 2f0b548c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- To tune the DAMON-based operation schemes, knowing how many and how large regions are affected by each of the schemes will be helful. Those stats could be used for not only the tuning, but also monitoring of the working set size and the number of regions, if the scheme does not change the program behavior too much. For the reason, this implements the statistics for the schemes. The total number and size of the regions that each scheme is applied are exported to users via '->stat_count' and '->stat_sz' of 'struct damos'. Admins can also check the number by reading 'schemes' debugfs file. The last two integers now represents the stats. To allow collecting the stats without changing the program behavior, this also adds new scheme action, 'DAMOS_STAT'. Note that 'DAMOS_STAT' is not only making no memory operation actions, but also does not reset the age of regions. Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2f0b548c) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit af122dd8 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes 'damon-dbgfs' to support the data access monitoring oriented memory management schemes. Users can read and update the schemes using ``<debugfs>/damon/schemes`` file. The format is:: <min/max size> <min/max access frequency> <min/max age> <action> Link: https://lkml.kernel.org/r/20211001125604.29660-5-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit af122dd8) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 6dea8add category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This makes DAMON's default primitives for virtual address spaces to support DAMON-based Operation Schemes (DAMOS) by implementing actions application functions and registering it to the monitoring context. The implementation simply links 'madvise()' for related DAMOS actions. That is, 'madvise(MADV_WILLNEED)' is called for 'WILLNEED' DAMOS action and similar for other actions ('COLD', 'PAGEOUT', 'HUGEPAGE', 'NOHUGEPAGE'). So, the kernel space DAMON users can now use the DAMON-based optimizations with only small amount of code. Link: https://lkml.kernel.org/r/20211001125604.29660-4-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6dea8add) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 1f366e42 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- In many cases, users might use DAMON for simple data access aware memory management optimizations such as applying an operation scheme to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds". Most simple form of the solution would be doing offline data access pattern profiling using DAMON and modifying the application source code or system configuration based on the profiling results. Or, developing a daemon constructed with two modules (one for access monitoring and the other for applying memory management actions via mlock(), madvise(), sysctl, etc) is imaginable. To avoid users spending their time for implementation of such simple data access monitoring-based operation schemes, this makes DAMON to handle such schemes directly. With this change, users can simply specify their desired schemes to DAMON. Then, DAMON will automatically apply the schemes to the user-specified target processes. Each of the schemes is composed with conditions for filtering of the target memory regions and desired memory management action for the target. Specifically, the format is:: <min/max size> <min/max access frequency> <min/max age> <action> The filtering conditions are size of memory region, number of accesses to the region monitored by DAMON, and the age of the region. The age of region is incremented periodically but reset when its addresses or access frequency has significantly changed or the action of a scheme was applied. For the action, current implementation supports a few of madvise()-like hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``, and ``NOHUGEPAGE``. Because DAMON supports various address spaces and application of the actions to a monitoring target region is dependent to the type of the target address space, the application code should be implemented by each primitives and registered to the framework. Note that this only implements the framework part. Following commit will implement the action applications for virtual address spaces primitives. Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Hildenbrand <david@redhat.com> Cc: David Rienjes <rientjes@google.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Greg Thelen <gthelen@google.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Marco Elver <elver@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 1f366e42) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit fda504fa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Patch series "Implement Data Access Monitoring-based Memory Operation Schemes". Introduction Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> ============ DAMON[1] can be used as a primitive for data access aware memory management optimizations. For that, users who want such optimizations should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. Such efforts will be inevitable for some complicated optimizations. However, in many other cases, the users would simply want the system to apply a memory management action to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB keeping only rare accesses more than 2 minutes", or "Do not use THP for a memory region larger than 2 MiB rarely accessed for more than 1 seconds". To make the works easier and non-redundant, this patchset implements a new feature of DAMON, which is called Data Access Monitoring-based Operation Schemes (DAMOS). Using the feature, users can describe the normal schemes in a simple way and ask DAMON to execute those on its own. [1] https://damonitor.github.io Evaluations =========== DAMOS is accurate and useful for memory management optimizations. An experimental DAMON-based operation scheme for THP, 'ethp', removes 76.15% of THP memory overheads while preserving 51.25% of THP speedup. Another experimental DAMON-based 'proactive reclamation' implementation, 'prcl', reduces 93.38% of residential sets and 23.63% of system memory footprint while incurring only 1.22% runtime overhead in the best case (parsec3/freqmine). NOTE that the experimental THP optimization and proactive reclamation are not for production but only for proof of concepts. Please refer to the showcase web site's evaluation document[1] for detailed evaluation setup and results. [1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html Long-term Support Trees ----------------------- For people who want to test DAMON but using LTS kernels, there are another couple of trees based on two latest LTS kernels respectively and containing the 'damon/master' backports. - For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y - For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y Sequence Of Patches =================== The 1st patch accounts age of each region. The 2nd patch implements the core of the DAMON-based operation schemes feature. The 3rd patch makes the default monitoring primitives for virtual address spaces to support the schemes. From this point, the kernel space users can use DAMOS. The 4th patch exports the feature to the user space via the debugfs interface. The 5th patch implements schemes statistics feature for easier tuning of the schemes and runtime access pattern analysis, and the 6th patch adds selftests for these changes. Finally, the 7th patch documents this new feature. This patch (of 7): DAMON can be used for data access pattern aware memory management optimizations. For that, users should run DAMON, read the monitoring results, analyze it, plan a new memory management scheme, and apply the new scheme by themselves. It would not be too hard, but still require some level of effort. For complicated cases, this effort is inevitable. That said, in many cases, users would simply want to apply an actions to a memory region of a specific size having a specific access frequency for a specific time. For example, "page out a memory region larger than 100 MiB but having a low access frequency more than 10 minutes", or "Use THP for a memory region larger than 2 MiB having a high access frequency for more than 2 seconds". For such optimizations, users will need to first account the age of each region themselves. To reduce such efforts, this implements a simple age account of each region in DAMON. For each aggregation step, DAMON compares the access frequency with that from last aggregation and reset the age of the region if the change is significant. Else, the age is incremented. Also, in case of the merge of regions, the region size-weighted average of the ages is set as the age of merged new region. Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Cc: Amit Shah <amit@kernel.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: David Woodhouse <dwmw@amazon.com> Cc: Marco Elver <elver@google.com> Cc: Leonard Foerster <foersleo@amazon.de> Cc: Greg Thelen <gthelen@google.com> Cc: Markus Boehme <markubo@amazon.de> Cc: David Rienjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit fda504fa) Signed-off-by: NYue Zou <zouyue3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Colin Ian King 提交于
mainline inclusion from mainline-5.16 commit 7ec1992b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Currently a plain integer is being used to nullify the pointer ctx->kdamond. Use NULL instead. Cleans up sparse warning: mm/damon/core.c:317:40: warning: Using plain integer as NULL pointer Link: https://lkml.kernel.org/r/20210925215908.181226-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com> Reviewed-by: NSeongJae Park <sj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 7ec1992b) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Changbin Du 提交于
mainline inclusion from mainline-5.16 commit 42e4cef5 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Just get the pid by 'current->pid'. Meanwhile, to be symmetrical make the 'starts' and 'finishes' logs both use debug level. Link: https://lkml.kernel.org/r/20210927232432.17750-1-changbin.du@gmail.comSigned-off-by: NChangbin Du <changbin.du@gmail.com> Reviewed-by: NSeongJae Park <sj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 42e4cef5) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Changbin Du 提交于
mainline inclusion from mainline-5.16 commit 5f7fe2b9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Just return from the kthread function. Link: https://lkml.kernel.org/r/20210927232421.17694-1-changbin.du@gmail.comSigned-off-by: NChangbin Du <changbin.du@gmail.com> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5f7fe2b9) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 704571f9 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Logging of kdamond startup is using 'pr_info()' unnecessarily. This makes it to use 'pr_debug()' instead. Link: https://lkml.kernel.org/r/20210917123958.3819-6-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 704571f9) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit d2f272b3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- A few Kernel-doc comments in 'damon.h' are broken. This fixes them. Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.orgSigned-off-by: NSeongJae Park <sjpark@amazon.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d2f272b3) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit 876d0aac category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Building DAMON documents warns for a reference to nonexisting doc, as below: $ time make htmldocs [...] Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans' This fixes the warning by removing the wrong reference. Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.orgSigned-off-by: NSeongJae Park <sjpark@amazon.de> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 876d0aac) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit f9803a99 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- This updates SeongJae's email address in MAINTAINERS file to his preferred one. Link: https://lkml.kernel.org/r/20210917123958.3819-3-sj@kernel.orgSigned-off-by: NSeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f9803a99) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.16 commit ad782c48 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Most memory management user guide documents are in 'admin-guide/mm/', but two of those are in 'vm/'. This moves the two docs into 'admin-guide/mm' for easier documents finding. Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit ad782c48) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Geert Uytterhoeven 提交于
mainline inclusion from mainline-5.16 commit f24b0626 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Correct a singular versus plural grammar mistake in the help text for the DAMON_VADDR config symbol. Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org Fixes: 3f49584b ("mm/damon: implement primitives for the virtual memory address spaces") Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Reviewed-by: NSeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit f24b0626) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 SeongJae Park 提交于
mainline inclusion from mainline-5.15 commit 2e014660 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- Kunit test cases for 'damon_split_regions_of()' expects the number of regions after calling the function will be same to their request ('nr_sub'). However, the requested number is just an upper-limit, because the function randomly decides the size of each sub-region. This fixes the wrong expectation. Link: https://lkml.kernel.org/r/20211028090628.14948-1-sj@kernel.org Fixes: 17ccae8b ("mm/damon: add kunit tests") Signed-off-by: NSeongJae Park <sj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 2e014660) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Adam Borowski 提交于
mainline inclusion from mainline-5.15-rc3 commit 892ab4bb category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK CVE: NA ------------------------------------------------- gcc knows the true length too, and rightfully complains. Link: https://lkml.kernel.org/r/20210912204447.10427-1-kilobyte@angband.plSigned-off-by: NAdam Borowski <kilobyte@angband.pl> Cc: SeongJae Park <sj38.park@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 892ab4bb) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Barry Song 提交于
mainline inclusion from tip/sched/core for v5.16 commit: 778c558f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4H33U CVE: NA Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/ ------------------------------------------------------------------------ This patch adds scheduler level for clusters and automatically enables the load balance among clusters. It will directly benefit a lot of workload which loves more resources such as memory bandwidth, caches. Testing has widely been done in two different hardware configurations of Kunpeng920: 24 cores in one NUMA(6 clusters in each NUMA node); 32 cores in one NUMA(8 clusters in each NUMA node) Workload is running on either one NUMA node or four NUMA nodes, thus, this can estimate the effect of cluster spreading w/ and w/o NUMA load balance. * Stream benchmark: 4threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 29929.64 ( 0.00%) 32932.68 ( 10.03%) MB/sec scale 29861.10 ( 0.00%) 32710.58 ( 9.54%) MB/sec add 27034.42 ( 0.00%) 32400.68 ( 19.85%) MB/sec triad 27225.26 ( 0.00%) 31965.36 ( 17.41%) 6threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 40330.24 ( 0.00%) 42377.68 ( 5.08%) MB/sec scale 40196.42 ( 0.00%) 42197.90 ( 4.98%) MB/sec add 37427.00 ( 0.00%) 41960.78 ( 12.11%) MB/sec triad 37841.36 ( 0.00%) 42513.64 ( 12.35%) 12threads stream (on 1NUMA * 24cores = 24cores) stream stream w/o patch w/ patch MB/sec copy 52639.82 ( 0.00%) 53818.04 ( 2.24%) MB/sec scale 52350.30 ( 0.00%) 53253.38 ( 1.73%) MB/sec add 53607.68 ( 0.00%) 55198.82 ( 2.97%) MB/sec triad 54776.66 ( 0.00%) 56360.40 ( 2.89%) Thus, it could help memory-bound workload especially under medium load. Similar improvement is also seen in lkp-pbzip2: * lkp-pbzip2 benchmark 2-96 threads (on 4NUMA * 24cores = 96cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11062841.57 ( 0.00%) 11341817.51 * 2.52%* Hmean tput-5 26815503.70 ( 0.00%) 27412872.65 * 2.23%* Hmean tput-8 41873782.21 ( 0.00%) 43326212.92 * 3.47%* Hmean tput-12 61875980.48 ( 0.00%) 64578337.51 * 4.37%* Hmean tput-21 105814963.07 ( 0.00%) 111381851.01 * 5.26%* Hmean tput-30 150349470.98 ( 0.00%) 156507070.73 * 4.10%* Hmean tput-48 237195937.69 ( 0.00%) 242353597.17 * 2.17%* Hmean tput-79 360252509.37 ( 0.00%) 362635169.23 * 0.66%* Hmean tput-96 394571737.90 ( 0.00%) 400952978.48 * 1.62%* 2-24 threads (on 1NUMA * 24cores = 24cores) lkp-pbzip2 lkp-pbzip2 w/o patch w/ patch Hmean tput-2 11071705.49 ( 0.00%) 11296869.10 * 2.03%* Hmean tput-4 20782165.19 ( 0.00%) 21949232.15 * 5.62%* Hmean tput-6 30489565.14 ( 0.00%) 33023026.96 * 8.31%* Hmean tput-8 40376495.80 ( 0.00%) 42779286.27 * 5.95%* Hmean tput-12 61264033.85 ( 0.00%) 62995632.78 * 2.83%* Hmean tput-18 86697139.39 ( 0.00%) 86461545.74 ( -0.27%) Hmean tput-24 104854637.04 ( 0.00%) 104522649.46 * -0.32%* In the case of 6 threads and 8 threads, we see the greatest performance improvement. Similar improvement can be seen on lkp-pixz though the improvement is smaller: * lkp-pixz benchmark 2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores) lkp-pixz lkp-pixz w/o patch w/ patch Hmean tput-2 6486981.16 ( 0.00%) 6561515.98 * 1.15%* Hmean tput-4 11645766.38 ( 0.00%) 11614628.43 ( -0.27%) Hmean tput-6 15429943.96 ( 0.00%) 15957350.76 * 3.42%* Hmean tput-8 19974087.63 ( 0.00%) 20413746.98 * 2.20%* Hmean tput-12 28172068.18 ( 0.00%) 28751997.06 * 2.06%* Hmean tput-18 39413409.54 ( 0.00%) 39896830.55 * 1.23%* Hmean tput-24 49101815.85 ( 0.00%) 49418141.47 * 0.64%* * SPECrate benchmark 4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores) Base Base Run Time Rate ------- --------- 4 Copies w/o 580 (w/ 570) w/o 11.1 (w/ 11.3) 8 Copies w/o 647 (w/ 605) w/o 20.0 (w/ 21.4, +7%) 16 Copies w/o 844 (w/ 844) w/o 30.6 (w/ 30.6) 32 Copies(on 4NUMA * 32 cores = 128cores) [w/o patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 584 87.2 * 502.gcc_r 32 503 90.2 * 505.mcf_r 32 745 69.4 * 520.omnetpp_r 32 1031 40.7 * 523.xalancbmk_r 32 597 56.6 * 525.x264_r 1 -- CE 531.deepsjeng_r 32 336 109 * 541.leela_r 32 556 95.4 * 548.exchange2_r 32 513 163 * 557.xz_r 32 530 65.2 * Est. SPECrate2017_int_base 80.3 [w/ patch] Base Base Base Benchmarks Copies Run Time Rate --------------- ------- --------- --------- 500.perlbench_r 32 580 87.8 (+0.688%) * 502.gcc_r 32 477 95.1 (+5.432%) * 505.mcf_r 32 644 80.3 (+13.574%) * 520.omnetpp_r 32 942 44.6 (+9.58%) * 523.xalancbmk_r 32 560 60.4 (+6.714%%) * 525.x264_r 1 -- CE 531.deepsjeng_r 32 337 109 (+0.000%) * 541.leela_r 32 554 95.6 (+0.210%) * 548.exchange2_r 32 515 163 (+0.000%) * 557.xz_r 32 524 66.0 (+1.227%) * Est. SPECrate2017_int_base 83.7 (+4.062%) On the other hand, it is slightly helpful to CPU-bound tasks like kernbench: * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores) kernbench kernbench w/o cluster w/ cluster Min user-24 12054.67 ( 0.00%) 12024.19 ( 0.25%) Min syst-24 1751.51 ( 0.00%) 1731.68 ( 1.13%) Min elsp-24 600.46 ( 0.00%) 598.64 ( 0.30%) Min user-48 12361.93 ( 0.00%) 12315.32 ( 0.38%) Min syst-48 1917.66 ( 0.00%) 1892.73 ( 1.30%) Min elsp-48 333.96 ( 0.00%) 332.57 ( 0.42%) Min user-96 12922.40 ( 0.00%) 12921.17 ( 0.01%) Min syst-96 2143.94 ( 0.00%) 2110.39 ( 1.56%) Min elsp-96 211.22 ( 0.00%) 210.47 ( 0.36%) Amean user-24 12063.99 ( 0.00%) 12030.78 * 0.28%* Amean syst-24 1755.20 ( 0.00%) 1735.53 * 1.12%* Amean elsp-24 601.60 ( 0.00%) 600.19 ( 0.23%) Amean user-48 12362.62 ( 0.00%) 12315.56 * 0.38%* Amean syst-48 1921.59 ( 0.00%) 1894.95 * 1.39%* Amean elsp-48 334.10 ( 0.00%) 332.82 * 0.38%* Amean user-96 12925.27 ( 0.00%) 12922.63 ( 0.02%) Amean syst-96 2146.66 ( 0.00%) 2122.20 * 1.14%* Amean elsp-96 211.96 ( 0.00%) 211.79 ( 0.08%) Note this patch isn't an universal win, it might hurt those workload which can benefit from packing. Though tasks which want to take advantages of lower communication latency of one cluster won't necessarily been packed in one cluster while kernel is not aware of clusters, they have some chance to be randomly packed. But this patch will make them more likely spread. Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com> Tested-by: NYicong Yang <yangyicong@hisilicon.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NYicong Yang <yangyicong@hisilicon.com> Reviewed-by: Ntao zeng <prime.zeng@hisilicon.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jonathan Cameron 提交于
mainline inclusion from tip/sched/core for v5.15-release commit: c5e22fef category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4GEZS CVE: NA Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/ ------------------------------------------------------------------------ Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data, but each cluster has local L3 tag. On the other hand, each clusters will share some internal system bus. +-----------------------------------+ +---------+ | +------+ +------+ +--------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | | +-----------+ | | | +------+ +------+ +--------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +---+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +--+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | +---------+ +-----------------------------------+ That means spreading tasks among clusters will bring more bandwidth while packing tasks within one cluster will lead to smaller cache synchronization latency. So both kernel and userspace will have a chance to leverage this topology to deploy tasks accordingly to achieve either smaller cache latency within one cluster or an even distribution of load among clusters for higher throughput. This patch exposes cluster topology to both kernel and userspace. Libraried like hwloc will know cluster by cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-clusterSigned-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: NTian Tao <tiantao6@hisilicon.com> Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.comSigned-off-by: NYicong Yang <yangyicong@hisilicon.com> Reviewed-by: Ntao zeng <prime.zeng@hisilicon.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-