- 02 9月, 2020 21 次提交
-
-
由 Olof Johansson 提交于
task #29600094 commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. Prior to eed85ff4 ("PCI/DPC: Enable DPC only if AER is available"), Linux handled DPC events regardless of whether firmware had granted it ownership of AER or DPC, e.g., via _OSC. PCIe r5.0, sec 6.2.10, recommends that the OS link control of DPC to control of AER, so after eed85ff4, Linux handles DPC events only if it has control of AER. On platforms that do not grant OS control of AER via _OSC, Linux DPC handling worked before eed85ff4 but not after. To make Linux DPC handling work on those platforms the same way they did before, add a "pcie_ports=dpc-native" kernel parameter that makes Linux handle DPC events regardless of whether it has control of AER. [bhelgaas: commit log, move pcie_ports_dpc_native to drivers/pci/] Link: https://lore.kernel.org/r/20191023192205.97024-1-olof@lixom.netSigned-off-by: NOlof Johansson <olof@lixom.net> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> (cherry picked from commit 35a0b2378c199d4f26e458b2ca38ea56aaf2d9b8) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Keith Busch 提交于
task #29600094 commit bdb5ac85777de67c909c9ad4327f03f7648b543f upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. We don't need to be paranoid about the topology changing while handling an error. If the device has changed in a hotplug capable slot, we can rely on the presence detection handling to react to a changing topology. Restore the fatal error handling behavior that existed before merging DPC with AER with 7e9084b3 ("PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices"). Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Reviewed-by: NSinan Kaya <okaya@kernel.org> (cherry picked from commit bdb5ac85777de67c909c9ad4327f03f7648b543f) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Masahiro Yamada 提交于
task #29499913 commit d198b34f3855eee2571dda03eea75a09c7c31480 upstream Add SPDX License Identifier to all .gitignore files. erwei: Because the locations of the .gitignore files from the upstream are different from our kernel, I add the context "# SPDX-License-Identifier: GPL-2.0-only" to our onw .gitignore files. Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Rafael J. Wysocki 提交于
task #29239886 commit a3299182216397a0b943d2549d1997f4eba2bdd2 upstream Add an admin-guide document for the intel_idle driver to describe how it works: how it enumerates idle states, what happens during the initialization of it, how it can be controlled via the kernel command line and so on. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: Nyjia <yingbao.jia@intel.com> Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Rafael J. Wysocki 提交于
task #29239886 commit 75a80267410e38ab76c4ceb39753f96d72113781 upstream In certain situations it may be useful to prevent some idle states from being used by default while allowing user space to enable them later on. For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to mark idle states that should be disabled by default, make the core set CPUIDLE_STATE_DISABLED_BY_USER for those states at the initialization time and add a new state attribute in sysfs, "default_status", to inform user space of the initial status of the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it, "enabled" otherwise). Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Nyjia <yingbao.jia@intel.com> Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Rafael J. Wysocki 提交于
task #29239886 commit aa5eee355b466cb33f97f79bed9740a472c4ab73 upstream Important information is missing from user/admin cpuidle documentation available today, so add a new user/admin document for cpuidle containing current and comprehensive information to admin-guide and drop the old .txt documents it is replacing. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org> Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org> Signed-off-by: Nyjia <yingbao.jia@intel.com> Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Shile Zhang 提交于
fix #29056122 commit 'fbb2f06e' ("pvpanic: add crash loaded event") introduce new pvpanic event 'PVPANIC_CRASH_LOADED', it make the qemu on host can get info that if the guest already handle the panic by kdump or not. But if the guest enabled the kdump, it will not post the panic event by default unless the parameter 'crash_kexec_post_notifiers' is given. So, its better to set the default value of this parameter to true, to avoid it missed in case of kdump enabled. If user want disable the event notification, the parameter 'crash_kexec_post_notifiers=N' should be given. Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Stefan Hajnoczi 提交于
task #28910367 commit 2d1d25d0a224dcd2021004d52342fc1727ccd85f upstream Add information about the new "virtiofs" file system. Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 David Hildenbrand 提交于
task #29077503 commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. [akpm@linux-foundation.org: fix comment text, per David] Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NPankaj gupta <pagupta@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020) Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 YangYuxi 提交于
fix #29256237 commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel Since 'commit f719e375 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, the first syn packet is dropped, this cause one second latency for the new connection, more discussion about this problem can easy search from google, such as: 1)One second connection delay in masque https://marc.info/?t=151683118100004&r=1&w=2 2)IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747 3)Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544 4)Additional 1s latency in `host -> service IP -> pod` https://github.com/kubernetes/kubernetes/issues/90854 5)kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client https://github.com/kubernetes/kubernetes/issues/81775 The root cause is when the old session is expired, the conntrack related to the session is dropped by ip_vs_conn_drop_conntrack. The code is as follows: ``` static void ip_vs_conn_expire(struct timer_list *t) { ... if ((cp->flags & IP_VS_CONN_F_NFCT) && !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) { /* Do not access conntracks during subsys cleanup * because nf_conntrack_find_get can not be used after * conntrack cleanup for the net. */ smp_rmb(); if (ipvs->enable) ip_vs_conn_drop_conntrack(cp); } ... } ``` As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT) is true, the function ip_vs_conn_drop_conntrack will be called. So we optimize this by following steps (Administrators can choose the following optimization by setting net.ipv4.vs.conn_reuse_old_conntrack=1): 1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will use the old session) 2) call ip_vs_conn_expire_now to release the old session, then the related conntrack will not be dropped 3) then ipvs unnecessary to drop the first syn packet, it just continue to pass the syn packet to the next process, create a new ipvs session, and the new session will related to the old conntrack(which is reopened by conntrack as a new one), the next whole things is just as normal as that the old session isn't used to exist. The above processing has no problems except for passive FTP, for passive FTP situation, ipvs can judging from condition (atomic_read(&cp->n_control)) and condition (cp->control). So, for other conditions(means not FTP), ipvs should give users the right to choose,they can choose a high performance one processing logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary because most business scenarios (such as kubernetes) are very sensitive to TCP short connection latency. This patch has been verified on our thousands of kubernets node servers on Tencent Inc. Signed-off-by: NYangYuxi <yx.atom1@gmail.com> [Tony: add the missing sysctl knob and disable it by default] Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Xuan Zhuo 提交于
to #27804112 Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com> Reviewed-by: NYa Zhao <zhaoya123@linux.alibaba.com> Reviewed-by: NCambda Zhu <cambda@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 1f4883f300da4f4d9d31eaa80f7debf6ce74843b upstream. Add theory of operation for the security support that's going into libnvdimm. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Reviewed-by: NJing Lin <jing.lin@intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Dave Jiang 提交于
to #27305291 commit 9db67581b91d9e9e05c35570ac3f93872e6c84ca upstream. Adding nvdimm key format type to encrypted keys in order to limit the size of the key to 32bytes. Signed-off-by: NDave Jiang <dave.jiang@intel.com> Acked-by: NMimi Zohar <zohar@linux.ibm.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 24512228b7a3f412b5a51f189df302616b021c33 upstream. Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") "broke" memory management on parisc. The machine is not NUMA but the DISCONTIG model creates three pgdats even though it's a UMA machine for the following ranges 0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB 1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB 2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB Mikulas reported: With the patch 1c30844d2, the kernel will incorrectly reclaim the first zone when it fills up, ignoring the fact that there are two completely free zones. Basiscally, it limits cache size to 1GiB. For example, if I run: # dd if=/dev/sda of=/dev/null bs=1M count=2048 - with the proper kernel, there should be "Buffers - 2GiB" when this command finishes. With the patch 1c30844d2, buffers will consume just 1GiB or slightly more, because the kernel was incorrectly reclaiming them. The page allocator and reclaim makes assumptions that pgdats really represent NUMA nodes and zones represent ranges and makes decisions on that basis. Watermark boosting for small pgdats leads to unexpected results even though this would have behaved reasonably on SPARSEMEM. DISCONTIG is essentially deprecated and even parisc plans to move to SPARSEMEM so there is no need to be fancy, this patch simply disables watermark boosting by default on DISCONTIGMEM. Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reported-by: NMikulas Patocka <mpatocka@redhat.com> Tested-by: NMikulas Patocka <mpatocka@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream. An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit bc3c03ccb4641fb940b27a0d369431876923a8fe upstream Add a build option and a command line parameter to build and enable the support of pseudo-NMIs. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Suggested-by: NDaniel Thompson <daniel.thompson@linaro.org> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit d98d0a990ca1446d3c0ca8f0b9ac127a66e40cdf upstream The values non secure EL1 needs to use for PMR and RPR registers depends on the value of SCR_EL3.FIQ. The values non secure EL1 sees from the distributor and redistributor depend on whether security is enabled for the GIC or not. To avoid having to deal with two sets of values for PMR masking/unmasking, only enable pseudo-NMIs when GIC has non-secure view of priorities. Also, add firmware requirements related to SCR_EL3. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 James Morse 提交于
task #28924046 [ Upstream commit 3276cc248964 ] Neoverse-N1 affected by #1349291 may report an Uncontained RAS Error as Unrecoverable. The kernel's architecture code already considers Unrecoverable errors as fatal as without kernel-first support no further error-handling is possible. Now that KVM attributes SError to the host/guest more precisely the host's architecture code will always handle host errors that become pending during world-switch. Errors misclassified by this errata that affected the guest will be re-injected to the guest as an implementation-defined SError, which can be uncontained. Until kernel-first support is implemented, no workaround is needed for this issue. Signed-off-by: NJames Morse <james.morse@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NBin Yu <jkchen@linux.alibaba.com> Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
task #28924046 [ Upstream commit a5325089bd05 ] We already mitigate erratum 1188873 affecting Cortex-A76 and Neoverse-N1 r0p0 to r2p0. It turns out that revisions r0p0 to r3p1 of the same cores are affected by erratum 1418040, which has the same workaround as 1188873. Let's expand the range of affected revisions to match 1418040, and repaint all occurences of 1188873 to 1418040. Whilst we're there, do a bit of reformating in silicon-errata.txt and drop a now unnecessary dependency on ARM_ARCH_TIMER_OOL_WORKAROUND. Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NBin Yu <jkchen@linux.alibaba.com> Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>
-
由 Marc Zyngier 提交于
task #28924046 [ Upstream commit 6989303a3b2d864fd8e17d3fa3365d3e9649a598 ] Neoverse-N1 is also affected by ARM64_ERRATUM_1188873, so let's add it to the list of affected CPUs. Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> [will: Update silicon-errata.txt] Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NBin Yu <jkchen@linux.alibaba.com> Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>
-
由 chenxiangzuo 提交于
fix #27418285 We introduce a boot parametter 'deferred_meminit' for defer page init feature. Default it is disabled, and we can pass 'deferred_meminit' to enable it. Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com> Reviewed-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
- 23 6月, 2020 1 次提交
-
-
由 Kirill A. Shutemov 提交于
task #27327988 commit 71a2c112a0f6da497e1b44e18e97b1716c240518 upstream 'max_ptes_shared' specifies how many pages can be shared across multiple processes. Exceeding the number would block the collapse:: /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared A higher value may increase memory footprint for some workloads. By default, at least half of pages has to be not shared. [colin.king@canonical.com: fix several spelling mistakes] Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NZi Yan <ziy@nvidia.com> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Reviewed-by: NZi Yan <ziy@nvidia.com> Acked-by: NYang Shi <yang.shi@linux.alibaba.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 28 4月, 2020 1 次提交
-
-
由 Babu Moger 提交于
to #26613714 commit a6f771c9bf4eea2da1516e70c283ede61a7d666f upstream. Rename intel_rdt_ui.txt to generic resctrl_ui.txt and update the documentation for AMD. Signed-off-by: NBabu Moger <babu.moger@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: David Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dmitry Safonov <dima@arista.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <linux-doc@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Pu Wen <puwen@hygon.cn> Cc: <qianyue.zj@alibaba-inc.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Rian Hunter <rian@alum.mit.edu> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <xiaochen.shen@intel.com> Link: https://lkml.kernel.org/r/20181121202811.4492-13-babu.moger@amd.comSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 13 4月, 2020 1 次提交
-
-
由 Alexander Duyck 提交于
to #26589565 Add documentation for free page reporting. Currently the only consumer is virtio-balloon, however it is possible that other drivers might make use of this so it is best to add a bit of documetation explaining at a high level how to use the API. Link: http://lkml.kernel.org/r/20200211224730.29318.43815.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
- 18 3月, 2020 12 次提交
-
-
由 Xu Yu 提交于
The memcg background async page reclaim, a.k.a, memcg kswapd, is implemented with a dedicated unbound workqueue currently. However, memcg kswapd will run too frequently, resulting in high overhead, page cache thrashing, frequent dirty page writeback, etc., due to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity, or even abnormal memcg memory usage. We need to find out the problematic memcg(s) where memcg kswapd introduces significant overhead. This records the latency of each run of memcg kswapd work, and then aggregates into the exstat of per memcg. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Marcelo Tosatti 提交于
commit 2d5ba19bdfef4dd06add144eb04287ee98409f75 upstream Add an MSRs which allows the guest to disable host polling (specifically the cpuidle-haltpoll, when performing polling in the guest, disables host side polling). Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
For some workloads whose io activities are mostly random, context readahead feature can introduce unnecessary io read operations, which will impact app's performance. Context readahead's algorithm is straightforward and not that smart. This patch adds "/proc/sys/vm/enable_context_readahead" to control whether to disable or enable this feature. Currently we enable context readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead to disable context readahead. We also have tested mongodb's performance in 'random point select' case, With context readahead enabled: mongodb eps 12409 With context readahead disabled: mongodb eps 14443 About 16% performance improvement. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Florian Westphal 提交于
commit 294304e4c522d797b7ea8200ab74354843fa68e9 upstream We have no explicit signal when a UDP stream has terminated, peers just stop sending. For suspected stream connections a timeout of two minutes is sane to keep NAT mapping alive a while longer. It matches tcp conntracks 'timewait' default timeout value. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Marcelo Tosatti 提交于
commit 2cffe9f6b96fece065ee8522673c90e92ef2085d upstream The cpuidle_haltpoll governor, in conjunction with the haltpoll cpuidle driver, allows guest vcpus to poll for a specified amount of time before halting. This provides the following benefits to host side polling: 1) The POLL flag is set while polling is performed, which allows a remote vCPU to avoid sending an IPI (and the associated cost of handling the IPI) when performing a wakeup. 2) The VM-exit cost can be avoided. The downside of guest side polling is that polling is performed even with other runnable tasks in the host. Results comparing halt_poll_ns and server/client application where a small packet is ping-ponged: host --> 31.33 halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) For the SAP HANA benchmarks (where idle_spin is a parameter of the previous version of the patch, results should be the same): hpns == halt_poll_ns idle_spin=0/ idle_spin=800/ idle_spin=0/ hpns=200000 hpns=0 hpns=800000 DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Rafael J. Wysocki 提交于
commit 61cb5758d3c46bc1ba87694fefc0d9653613ce6b upstream Add cpuidle.governor= command line parameter to allow the default cpuidle governor to be replaced. That is useful, for example, if someone running a tickful kernel wants to use the menu governor on it. Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Caspar Zhang 提交于
Cloud Kernel is the official name of our project, this patch unitizes the project names used in docs and comments. Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Wenwei Tao 提交于
Add "memory.priority" and "memory.use_priority_oom" descriptions. Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Xunlei Pang 提交于
This file collects all the interfaces specific to Alibaba Cloud Kernel. Add "memory.wmark_min_adj", "memory.exstat", and "zombie memcgs reaper" descriptions. Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Shameer Kolothum 提交于
commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream HiSilicon erratum 162001800 describes the limitation of SMMUv3 PMCG implementation on HiSilicon Hip08 platforms. On these platforms, the PMCG event counter registers (SMMU_PMCG_EVCNTRn) are read only and as a result it is not possible to set the initial counter period value on event monitor start. To work around this, the current value of the counter is read and used for delta calculations. OEM information from ACPI header is used to identify the affected hardware platforms. Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com> Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org> Reviewed-by: NRobin Murphy <robin.murphy@arm.com> Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com> [will: update silicon-errata.txt and add reason string to acpi match] Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: Zou Cao<zoucao@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Roman Gushchin 提交于
commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b upstream. It was reported that on some of our machines containers were restarted with OOM symptoms without an obvious reason. Despite there were almost no memory pressure and plenty of page cache, MEMCG_OOM event was raised occasionally, causing the container management software to think, that OOM has happened. However, no tasks have been killed. The following investigation showed that the problem is caused by a failing attempt to charge a high-order page. In such case, the OOM killer is never invoked. As shown below, it can happen under conditions, which are very far from a real OOM: e.g. there is plenty of clean page cache and no memory pressure. There is no sense in raising an OOM event in this case, as it might confuse a user and lead to wrong and excessive actions (e.g. restart the workload, as in my case). Let's look at the charging path in try_charge(). If the memory usage is about memory.max, which is absolutely natural for most memory cgroups, we try to reclaim some pages. Even if we were able to reclaim enough memory for the allocation, the following check can fail due to a race with another concurrent allocation: if (mem_cgroup_margin(mem_over_limit) >= nr_pages) goto retry; For regular pages the following condition will save us from triggering the OOM: if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER)) goto retry; But for high-order allocation this condition will intentionally fail. The reason behind is that we'll likely fall to regular pages anyway, so it's ok and even preferred to return ENOMEM. In this case the idea of raising MEMCG_OOM looks dubious. Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation order check, so that the event won't be raised for high order allocations. This change doesn't affect regular pages allocation and charging. Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Shakeel Butt 提交于
commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream. The memory controller in cgroup v2 exposes memory.events file for each memcg which shows the number of times events like low, high, max, oom and oom_kill have happened for the whole tree rooted at that memcg. Users can also poll or register notification to monitor the changes in that file. Any event at any level of the tree rooted at memcg will notify all the listeners along the path till root_mem_cgroup. There are existing users which depend on this behavior. However there are users which are only interested in the events happening at a specific level of the memcg tree and not in the events in the underlying tree rooted at that memcg. One such use-case is a centralized resource monitor which can dynamically adjust the limits of the jobs running on a system. The jobs can create their sub-hierarchy for their own sub-tasks. The centralized monitor is only interested in the events at the top level memcgs of the jobs as it can then act and adjust the limits of the jobs. Using the current memory.events for such centralized monitor is very inconvenient. The monitor will keep receiving events which it is not interested and to find if the received event is interesting, it has to read memory.event files of the next level and compare it with the top level one. So, let's introduce memory.events.local to the memcg which shows and notify for the events at the memcg level. Now, does memory.stat and memory.pressure need their local versions. IMHO no due to the no internal process contraint of the cgroup v2. The memory.stat file of the top level memcg of a job shows the stats and vmevents of the whole tree. The local stats or vmevents of the top level memcg will only change if there is a process running in that memcg but v2 does not allow that. Similarly for memory.pressure there will not be any process in the internal nodes and thus no chance of local pressure. Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NRoman Gushchin <guro@fb.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Chris Down <chris@chrisdown.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 17 1月, 2020 3 次提交
-
-
由 Pu Wen 提交于
commit 24beb83ad289c68bce7c01351cb90465bbb1940a upstream. The Hygon Dhyana CPU has the SMBus device with PCI device ID 0x790b, which is the same as AMD CZ SMBus device. So add Hygon Dhyana support to the i2c-piix4 driver by using the code path of AMD. Signed-off-by: NPu Wen <puwen@hygon.cn> Reviewed-by: NJean Delvare <jdelvare@suse.de> Signed-off-by: NWolfram Sang <wsa@the-dreams.de> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
commit fb7e160019f4abb4082740bfeb27a38f6389c745 upstream. This new methods is used to explicitly poll for I/O completion for an iocb. It must be called for any iocb submitted asynchronously (that is with a non-null ki_complete) which has the IOCB_HIPRI flag set. The method is assisted by a new ki_cookie field in struct iocb to store the polling cookie. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Joseph Qi 提交于
Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot parameter psi_v1 to enable psi cgroup v1 support. Default it is disabled, which means when passing psi=1 boot parameter, we only support cgroup v2. This is to keep consistent with other cgroup v1 features such as cgroup writeback v1 (cgwb_v1). Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 15 1月, 2020 1 次提交
-
-
由 Gavin Shan 提交于
This enables scanning pages in fixed interval to determine their access frequency (hot/cold). The result is exported to user land on basis of memory cgroup by "memory.idle_page_stats". The design is highlighted as below: * A kernel thread is spawn when this feature is enabled by writing non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds". The thread sequentially scans the nodes and their pages that have been chained up in LRU list. * For each page, its corresponding age information is stored in the page flags or array in node. The age represents the scanning intervals in which the page isn't accessed. Also, the page flag (PG_idle) is leveraged. The page's age is increased by one if the idle flag isn't cleared in two consective scans. Otherwise, the page's age is cleared out. Also, the page's age information is cleared when it's free'd so that the stale age information won't be fetched when it's allocated. * Initially, the flag is set, while the access bit in its PTE is cleared out by the thread. In next scanning period, its PTE access bit is synchronized with the page flag: clear the flag if access bit is set. The flag is kept otherwise. For unmapped pages, the flag is cleared when it's accessed. * Eventually, the page's aging information is updated to the unstable bucket of its corresponding memory cgroup, taking as statistics. The unstable bucket (statistics) is copied to stable bucket when all pages in all nodes are scanned for once. The stable bucket (statistics) is exported to user land through "memory.idle_page_stats". TESTING ======= * cgroup1, unmapped pagecache # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128 # # echo 1 > /sys/kernel/mm/kidled/use_hierarchy # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds # mkdir -p /cgroup/memory # mount -tcgroup -o memory /cgroup/memory # echo 1 > /cgroup/memory/memory.use_hierarchy # mkdir -p /cgroup/memory/test # echo 1 > /cgroup/memory/test/memory.use_hierarchy # # echo $$ > /cgroup/memory/test/cgroup.procs # dd if=/ext4/test.data of=/dev/null bs=1M count=128 # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 0 0 134217728 0 0 0 0 * cgroup1, mapped pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and access the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfei cfei 0 134217728 0 0 0 0 0 0 * cgroup1, mapped and locked pagecache # < create same file and memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap the whole created file and mlock the area > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep cfui cfui 0 134217728 0 0 0 0 0 0 * cgroup1, anonymous and locked area # < create memory cgroups as above > # # echo $$ > /cgroup/memory/test/cgroup.procs # < run program to mmap anonymous area and mlock it > # < wait a few minutes > # cat /cgroup/memory/test/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 # cat /cgroup/memory/memory.idle_page_stats | grep csui csui 0 0 134217728 0 0 0 0 0 * Rerun above test cases in cgroup2 and the results are no exceptional. However, the cgroups are populated in different way as below: # mkdir -p /cgroup # mount -tcgroup2 none /cgroup # echo "+memory" > /cgroup/cgroup.subtree_control # mkdir -p /cgroup/test Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-