- 30 4月, 2020 4 次提交
-
-
由 Pankaj Gupta 提交于
fix #27138800 commit 32de1484648a837db5dea0a7007fe7136804e392 upstream. This patch introduces 'daxdev_mapping_supported' helper which checks if 'MAP_SYNC' is supported with filesystem mapping. It also checks if corresponding dax_device is synchronous. Virtio pmem device is asynchronous and does not not support VM_SYNC. Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit fefc1d97fa4b5e016bbe15447dc3edcd9e1bcb9f upstream. This patch adds 'DAXDEV_SYNC' flag which is set for nd_region doing synchronous flush. This later is used to disable MAP_SYNC functionality for ext4 & xfs filesystem for devices don't support synchronous flush. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit 6e84200c0a2994b991259d19450eee561029bf70 upstream. This patch adds virtio-pmem driver for KVM guest. Guest reads the persistent memory range information from Qemu over VIRTIO and registers it on nvdimm_bus. It also creates a nd_region object with the persistent memory range information so that existing 'nvdimm/pmem' driver can reserve this into system memory map. This way 'virtio-pmem' driver uses existing functionality of pmem driver to register persistent memory compatible for DAX capable filesystems. This also provides function to perform guest flush over VIRTIO from 'pmem' driver when userspace performs flush on DAX memory range. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NJakub Staron <jstaron@google.com> Tested-by: NJakub Staron <jstaron@google.com> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit c5d4355d10d414a96ca870b731756b89d068d57a upstream. This patch adds functionality to perform flush from guest to host over VIRTIO. We are registering a callback based on 'nd_region' type. virtio_pmem driver requires this special flush function. For rest of the region types we are registering existing flush function. Report error returned by host fsync failure to userspace. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 29 4月, 2020 2 次提交
-
-
由 Al Viro 提交于
fix #27211210 commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream. Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Al Viro 提交于
fix #27211210 commit d41efb522e902364ab09c782d511c1bedc388ddd upstream. There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check *into* follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 28 4月, 2020 1 次提交
-
-
由 Babu Moger 提交于
to #26613714 commit 6fe07ce35e8ad870ba1cf82e0481e0fc0f526eff upstream. The resource control feature is supported by both Intel and AMD. So, rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL. Now CONFIG_RESCTRL will be used for both Intel and AMD to enable Resource Control support. Update the texts in config and condition accordingly. [ bp: Simplify Kconfig text. ] Signed-off-by: NBabu Moger <babu.moger@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: David Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dmitry Safonov <dima@arista.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <linux-doc@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Pu Wen <puwen@hygon.cn> Cc: <qianyue.zj@alibaba-inc.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Rian Hunter <rian@alum.mit.edu> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <xiaochen.shen@intel.com> Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com [ Shile: fixed conflict in arch/x86/Kconfig ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 26 4月, 2020 1 次提交
-
-
由 zhongjiang-ali 提交于
to #24843736 It is because too much memcg oom printed message will trigger the softlockup. In general, we use the same ratelimit oom_rc between system and memcg to limit the print message. But it is more frequent to exceed its limit of the memcg, thus it would will result in oom easily. And A lot of printed information will be outputed. It's likely to trigger softlockup. The patch use different ratelimit to limit the memcg and system oom. And we test the patch using the default value in the memcg, The issue will go. [xuyu: adjust corresponding sysctl indexes] Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
-
- 24 4月, 2020 5 次提交
-
-
由 Yihao Wu 提交于
to #26424323 We account iowait when the cgroup's se is idle, and it has blocked task on the hierarchy of se->my_q. To achieve this, we also add cg_nr_running to track the hierarchical number of blocked tasks. We do it when a blocked task wakes up or a task is blocked. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 From the previous patch. We know there are 4 possible states. Since steal state's transition is complex. We choose to account its supplement. steal = elapse - idle - sum_exec_raw - ineffective Where elapse is the time since the cgroup is created. sum_exec_raw is the running time including IRQ time. ineffective is the total time that the cpuacct-binded cpuset doesn't allow this cpu for the cgroup. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 Since we concern idle, let's take idle as the center state. And omit transition between other stats. Below is the state transition graph: sleep->deque +-----------+ cpumask +-------+ exit->deque +-------+ |ineffective|-------- | idle | <-----------|running| +-----------+ +-------+ +-------+ ^ | unthrtl child -> deque | | wake -> deque | |thrtl chlid -> enque migrate -> deque | |migrate -> enque | v +-------+ | steal | +-------+ We conclude idle state condition as: !se->on_rq && !my_q->throttled && cpu allowed. From this graph and condition, we can hook (de|en)queue_task_fair update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state. In the hooked functions, we also check the conditions, to avoid accounting unwanted cpu clocks. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
to #26424323 Add the cgroup file "cpuacct.proc_stat", we'll export per-cgroup cpu usages and some other scheduler statistics in this interface. Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
由 Xunlei Pang 提交于
to #26424323 It's relatively easy to maintain nr_uninterruptible in scheduler compared to doing it in cpuacct, we assume that "cpu,cpuacct" are bound together, so that it can be used for per-cgroup load. This will be needed to calculate per-cgroup load average later. Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
- 23 4月, 2020 3 次提交
-
-
由 Mel Gorman 提交于
to #26255339 commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #26255339 commit e332f741a8dd1ec9a6dc8aa997296ecbfe64323e upstream Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NQian Cai <cai@lca.pw> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pwSigned-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #26255339 commit a921444382b49cc7fdeca3fba3e278bc09484a27 upstream This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 22 4月, 2020 2 次提交
-
-
由 Suthikulpanit, Suravee 提交于
fix #26319040 commit b9c6ff94e43a0ee053e0c1d983fba1ac4953b762 upstream. Re-factore the logic for activate/deactivate guest virtual APIC mode (GAM) into helper functions, and export them for other drivers (e.g. SVM). to support run-time activate/deactivate of SVM AVIC. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com> Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
-
由 Jeremy Linton 提交于
to #25688970 commit 56855a99f3d0d1e9f1f4e24f5851f9bf14c83296 upstream ACPI 6.3 adds a flag to indicate that child nodes are all identical cores. This is useful to authoritatively determine if a set of (possibly offline) cores are identical or not. Since the flag doesn't give us a unique id we can generate one and use it to create bitmaps of sibling nodes, or simply in a loop to determine if a subset of cores are identical. Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: NHanjun Guo <hanjun.guo@linaro.org> Reviewed-by: NSudeep Holla <sudeep.holla@arm.com> Signed-off-by: NJeremy Linton <jeremy.linton@arm.com> Signed-off-by: NWill Deacon <will@kernel.org> Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com> Acked-by: Nzou cao <zoucao@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 17 4月, 2020 2 次提交
-
-
由 Xunlei Pang 提交于
to #26782094 Pin code section of process in memory for the corresponding VMAs like mlock does. Usage: - pin process "PID" echo PID > /proc/unevictable/add_pid - unpin it echo PID > /proc/unevictable/del_pid - show all pinned process pids cat /proc/unevictable/add_pid For easy maintenance, we place it in kernel because it has no side effect if don't use it. Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 This reworks memsli "start", "end", "update" interfaces to make it more clear and symmetrical, by merging "update" action into "end", just like what psi_memstall_{enter, leave} does. Now the latency probe pattern of memsli is as follows: memcg_lat_stat_start(&start); /* kernel codes being probed */ memcg_lat_stat_end(MEM_LAT_XXX, start); This also formats the codes and fixes the warning(s) produced when CONFIG_MEMSLI is not set. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
- 16 4月, 2020 6 次提交
-
-
由 Xu Yu 提交于
to #26424368 This introduces the new bool kconfig MEMSLI, determining whether the memsli (memory latency histogram) feature should be built-in or not. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Since memsli also records latency histogram for swapout and swapin, which are NOT in the slow memory path, the overhead of memsli could be nonnegligible in some specific scenarios. For example, in scenarios with frequent swapping out and in, memsli could introduce overhead of ~1% of total run time of the synthetic testcase. This adds procfs interface for memsli switch. The memsli feature is enabled by default, and you can now disable it by: $ echo 0 > /proc/memsli/enabled Apparently, you can check current memsli switch status by: $ cat /proc/memsli/enabled Note that disabling memsli at runtime will NOT clear the existing latency histogram. You still need to manually reset the specified latency histogram(s) by echo 0 into the corresponding cgroup control file(s). Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of global swapout, memcg swapout and swapin respectively, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_swapout_global_latency 0-1ms: 98313 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 52 Each line is the count of global swapout within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_swapout_global_latency $ cat memory.direct_swapout_global_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 The usage of memory.direct_swapout_memcg_latency and memory.direct_swapin_latency is the same as memory.direct_swapout_global_latency. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 There are some duplicate codes in the original implementation of memory latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x, y, z represents various types of memory latency. This reworks common codes of memory latency histogram to make it easier to add more types of memory latency later. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of direct compact, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_compact_latency 0-1ms: 1176 1-5ms: 259 5-10ms: 17 10-100ms: 10 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 921 Each line is the count of direct compact within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_compact_latency $ cat memory.direct_compact_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of global direct reclaim and memcg direct reclaim, respectively, and then group into the latency histogram in struct mem_cgroup. Besides, the total latency is accumulated each time the histogram is updated. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_reclaim_global_latency 0-1ms: 228 1-5ms: 283 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 539 Each line is the count of global direct reclaim within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_reclaim_global_latency $ cat memory.direct_reclaim_global_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 The usage of memory.direct_reclaim_memcg_latency is the same as memory.direct_reclaim_global_latency. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 13 4月, 2020 8 次提交
-
-
由 Alexander Duyck 提交于
to #26589565 In order to keep ourselves from reporting pages that are just going to be reused again in the case of heavy churn we can put a limit on how many total pages we will process per pass. Doing this will allow the worker thread to go into idle much more quickly so that we avoid competing with other threads that might be allocating or freeing pages. The logic added here will limit the worker thread to no more than one sixteenth of the total free pages in a given area per list. Once that limit is reached it will update the state so that at the end of the pass we will reschedule the worker to try again in 2 seconds when the memory churn has hopefully settled down. Again this optimization doesn't show much of a benefit in the standard case as the memory churn is minmal. However with page allocator shuffling enabled the gain is quite noticeable. Below are the results with a THP enabled version of the will-it-scale page_fault1 test showing the improvement in iterations for 16 processes or threads. Without: tasks processes processes_idle threads threads_idle 16 8283274.75 0.17 5594261.00 38.15 With: tasks processes processes_idle threads threads_idle 16 8767010.50 0.21 5791312.75 36.98 Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Alexander Duyck 提交于
to #26589565 Add support for the page reporting feature provided by virtio-balloon. Reporting differs from the regular balloon functionality in that is is much less durable than a standard memory balloon. Instead of creating a list of pages that cannot be accessed the pages are only inaccessible while they are being indicated to the virtio interface. Once the interface has acknowledged them they are placed back into their respective free lists and are once again accessible by the guest system. Unlike a standard balloon we don't inflate and deflate the pages. Instead we perform the reporting, and once the reporting is completed it is assumed that the page has been dropped from the guest and will be faulted back in the next time the page is accessed. Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Alexander Duyck 提交于
to #26589565 In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Alexander Duyck 提交于
to #26589565 In order to enable the use of the zone from the list manipulator functions I will need access to the zone pointer. As it turns out most of the accessors were always just being directly passed &zone->free_area[order] anyway so it would make sense to just fold that into the function itself and pass the zone and order as arguments instead of the free area. In order to be able to reference the zone we need to move the declaration of the functions down so that we have the zone defined before we define the list manipulation functions. Since the functions are only used in the file mm/page_alloc.c we can just move them there to reduce noise in the header. Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NPankaj Gupta <pagupta@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Dan Williams 提交于
to #26589565 commit b03641af680959df57c275a80ff0dc116627c7ae upstream In preparation for runtime randomization of the zone lists, take all (well, most of) the list_*() functions in the buddy allocator and put them in helper functions. Provide a common control point for injecting additional behavior when freeing pages. [dan.j.williams@intel.com: fix buddy list helpers] Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter] Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Cc: Keith Busch <keith.busch@intel.com> Cc: Robert Elliott <elliott@hpe.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Yang Shi 提交于
to #26589565 The list_is_first() and list_rotate_to_front() functions are needed by free page hinting. list_is_first() was used by i915 driver, so moved it to common header file. Backported list_rotate_to_front() from commit a16b53849913 ("list: add function list_rotate_to_front()"). Not backport the whole patch, just copy the implementation of the function. Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Wei Wang 提交于
to #26589565 commit 2e991629bcf55a43681aec1ee096eeb03cf81709 upstream The VIRTIO_BALLOON_F_PAGE_POISON feature bit is used to indicate if the guest is using page poisoning. Guest writes to the poison_val config field to tell host about the page poisoning value that is in use. Suggested-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NWei Wang <wei.w.wang@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Wei Wang 提交于
to #26589565 commit 86a559787e6f5cf662c081363f64a20cad654195 upstream Negotiation of the VIRTIO_BALLOON_F_FREE_PAGE_HINT feature indicates the support of reporting hints of guest free pages to host via virtio-balloon. Currenlty, only free page blocks of MAX_ORDER - 1 are reported. They are obtained one by one from the mm free list via the regular allocation function. Host requests the guest to report free page hints by sending a new cmd id to the guest via the free_page_report_cmd_id configuration register. When the guest starts to report, it first sends a start cmd to host via the free page vq, which acks to host the cmd id received. When the guest finishes reporting free pages, a stop cmd is sent to host via the vq. Host may also send a stop cmd id to the guest to stop the reporting. VIRTIO_BALLOON_CMD_ID_STOP: Host sends this cmd to stop the guest reporting. VIRTIO_BALLOON_CMD_ID_DONE: Host sends this cmd to tell the guest that the reported pages are ready to be freed. Why does the guest free the reported pages when host tells it is ready to free? This is because freeing pages appears to be expensive for live migration. free_pages() dirties memory very quickly and makes the live migraion not converge in some cases. So it is good to delay the free_page operation when the migration is done, and host sends a command to guest about that. Why do we need the new VIRTIO_BALLOON_CMD_ID_DONE, instead of reusing VIRTIO_BALLOON_CMD_ID_STOP? This is because live migration is usually done in several rounds. At the end of each round, host needs to send a VIRTIO_BALLOON_CMD_ID_STOP cmd to the guest to stop (or say pause) the reporting. The guest resumes the reporting when it receives a new command id at the beginning of the next round. So we need a new cmd id to distinguish between "stop reporting" and "ready to free the reported pages". TODO: - Add a batch page allocation API to amortize the allocation overhead. Signed-off-by: NWei Wang <wei.w.wang@intel.com> Signed-off-by: NLiang Li <liang.z.li@intel.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
- 02 4月, 2020 1 次提交
-
-
由 Jiri Slaby 提交于
fix #25967152 commit dce05aa6eec977f1472abed95ccd71276b9a3864 upstream Avoid global variables (namely sel_cons) by introducing vc_is_sel. It checks whether the parameter is the current selection console. This will help putting sel_cons to a struct later. Signed-off-by: NJiri Slaby <jslaby@suse.cz> Link: https://lore.kernel.org/r/20200219073951.16151-1-jslaby@suse.czSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 25 3月, 2020 1 次提交
-
-
由 Stephen Rothwell 提交于
fix #25924252 commit 7dcddef6f769d7e60691c732eb6d09cdb1d9df76 upstream An x86_64 allmodconfig build produces these errors: x86_64-linux-gnu-ld: kernel/sched/core.o: in function `cpuidle_poll_time': core.c:(.text+0x230): multiple definition of `cpuidle_poll_time'; arch/x86/= kernel/process.o:process.c:(.text+0xc0): first defined here (and more) Fixes: 259231a04561 ("cpuidle: add poll_limit_ns to cpuidle_device structure") Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 18 3月, 2020 4 次提交
-
-
由 Xu Yu 提交于
Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one thread from each thread group, we can make cgroup_subsys_state::nr_tasks to record only the thread group leader, i.e., process, instead of thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to cgroup_subsys_state::nr_procs. Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in the css and its descendants") Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
The memcg background async page reclaim, a.k.a, memcg kswapd, is implemented with a dedicated unbound workqueue currently. However, memcg kswapd will run too frequently, resulting in high overhead, page cache thrashing, frequent dirty page writeback, etc., due to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity, or even abnormal memcg memory usage. We need to find out the problematic memcg(s) where memcg kswapd introduces significant overhead. This records the latency of each run of memcg kswapd work, and then aggregates into the exstat of per memcg. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 5bd90b0989731520f2cdcfbbe467f1271f3cc803 upstream. In order to find out whether a vcpu is likely to be the target of VLPIs (and to further optimize the way we deal with those), let's track the number of VLPIs a vcpu can receive. This gets implemented with an atomic variable that gets incremented or decremented on map, unmap and move of a VLPI. Signed-off-by: NMarc Zyngier <maz@kernel.org> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NChristoffer Dall <christoffer.dall@arm.com> Link: https://lore.kernel.org/r/20191107160412.30301-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
commit 8e01d9a396e6db153d94a6004e6473d9ff251a6a upstream. When the VHE code was reworked, a lot of the vgic stuff was moved around, but the GICv4 residency code did stay untouched, meaning that we come in and out of residency on each flush/sync, which is obviously suboptimal. To address this, let's move things around a bit: - Residency entry (flush) moves to vcpu_load - Residency exit (sync) moves to vcpu_put - On blocking (entry to WFI), we "put" - On unblocking (exit from WFI), we "load" Because these can nest (load/block/put/load/unblock/put, for example), we now have per-VPE tracking of the residency state. Additionally, vgic_v4_put gains a "need doorbell" parameter, which only gets set to true when blocking because of a WFI. This allows a finer control of the doorbell, which now also gets disabled as soon as it gets signaled. Signed-off-by: NMarc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20191027144234.8395-2-maz@kernel.orgSigned-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com> Acked-by: NZou Cao <zoucao@linux.alibaba.com>
-