- 25 2月, 2017 24 次提交
-
-
由 Mel Gorman 提交于
As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static work_struct to co-ordinate the draining of per-cpu pages on the workqueue. Only one task can drain at a time but this is better than the previous scheme that allowed multiple tasks to send IPIs at a time. One consideration is whether parallel requests should synchronise against each other. This patch does not synchronise for a global drain as the common case for such callers is expected to be multiple parallel direct reclaimers competing for pages when the watermark is close to min. Draining the per-cpu list is unlikely to make much progress and serialising the drain is of dubious merit. Drains are synchonrised for callers such as memory hotplug and CMA that care about the drain being complete when the function returns. Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Suggested-by: NTejun Heo <tj@kernel.org> Suggested-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Since commit 682a3385 ("mm, page_alloc: inline the fast path of the zonelist iterator") we replace a NULL nodemask with cpuset_current_mems_allowed in the fast path, so that get_page_from_freelist() filters nodes allowed by the cpuset via for_next_zone_zonelist_nodemask(). In that case it's pointless to additionaly check __cpuset_zone_allowed() in each iteration, which we can avoid by not adding ALLOC_CPUSET to alloc_flags in that scenario. This saves some cycles in the allocator fast path on systems with one or more non-root cpuset configured. In the slow path, ALLOC_CPUSET is reset according to __alloc_pages_slowpath(). Without configured cpusets, this code is disabled by a static key. Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The allocation fast path contains two similar checks for zoneref->zone being NULL, where zoneref points either to the first zone in the zonelist, or to the preferred zone. These can be NULL either due to empty zonelist, or no zone being compatible with given nodemask or task's cpuset. These checks are unnecessary, because the zonelist walks in first_zones_zonelist() and get_page_from_freelist() handle a NULL starting zoneref->zone or preferred_zoneref->zone safely. It's safe to fallback to __alloc_pages_slowpath() where we also have the check early enough. Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@kernel.org> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 seokhoon.yoon 提交于
mmap_init() is no longer associated with VMA slab. So fix it. Link: http://lkml.kernel.org/r/1485182601-9294-1-git-send-email-iamyooon@gmail.comSigned-off-by: Nseokhoon.yoon <iamyooon@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Jiang 提交于
->fault(), ->page_mkwrite(), and ->pfn_mkwrite() calls do not need to take a vma and vmf parameter when the vma already resides in vmf. Remove the vma parameter to simplify things. [arnd@arndb.de: fix ARM build] Link: http://lkml.kernel.org/r/20170125223558.1451224-1-arnd@arndb.de Link: http://lkml.kernel.org/r/148521301778.19116.10840599906674778980.stgit@djiang5-desk3.ch.intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Matthew Wilcox <mawilcox@microsoft.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jan Kara <jack@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Many workloads that allocate pages are not handling an interrupt at a time. As allocation requests may be from IRQ context, it's necessary to disable/enable IRQs for every page allocation. This cost is the bulk of the free path but also a significant percentage of the allocation path. This patch alters the locking and checks such that only irq-safe allocation requests use the per-cpu allocator. All others acquire the irq-safe zone->lock and allocate from the buddy allocator. It relies on disabling preemption to safely access the per-cpu structures. It could be slightly modified to avoid soft IRQs using it but it's not clear it's worthwhile. This modification may slow allocations from IRQ context slightly but the main gain from the per-cpu allocator is that it scales better for allocations from multiple contexts. There is an implicit assumption that intensive allocations from IRQ contexts on multiple CPUs from a single NUMA node are rare and that the fast majority of scaling issues are encountered in !IRQ contexts such as page faulting. It's worth noting that this patch is not required for a bulk page allocator but it significantly reduces the overhead. The following is results from a page allocator micro-benchmark. Only order-0 is interesting as higher orders do not use the per-cpu allocator 4.10.0-rc2 4.10.0-rc2 vanilla irqsafe-v1r5 Amean alloc-odr0-1 287.15 ( 0.00%) 219.00 ( 23.73%) Amean alloc-odr0-2 221.23 ( 0.00%) 183.23 ( 17.18%) Amean alloc-odr0-4 187.00 ( 0.00%) 151.38 ( 19.05%) Amean alloc-odr0-8 167.54 ( 0.00%) 132.77 ( 20.75%) Amean alloc-odr0-16 156.00 ( 0.00%) 123.00 ( 21.15%) Amean alloc-odr0-32 149.00 ( 0.00%) 118.31 ( 20.60%) Amean alloc-odr0-64 138.77 ( 0.00%) 116.00 ( 16.41%) Amean alloc-odr0-128 145.00 ( 0.00%) 118.00 ( 18.62%) Amean alloc-odr0-256 136.15 ( 0.00%) 125.00 ( 8.19%) Amean alloc-odr0-512 147.92 ( 0.00%) 121.77 ( 17.68%) Amean alloc-odr0-1024 147.23 ( 0.00%) 126.15 ( 14.32%) Amean alloc-odr0-2048 155.15 ( 0.00%) 129.92 ( 16.26%) Amean alloc-odr0-4096 164.00 ( 0.00%) 136.77 ( 16.60%) Amean alloc-odr0-8192 166.92 ( 0.00%) 138.08 ( 17.28%) Amean alloc-odr0-16384 159.00 ( 0.00%) 138.00 ( 13.21%) Amean free-odr0-1 165.00 ( 0.00%) 89.00 ( 46.06%) Amean free-odr0-2 113.00 ( 0.00%) 63.00 ( 44.25%) Amean free-odr0-4 99.00 ( 0.00%) 54.00 ( 45.45%) Amean free-odr0-8 88.00 ( 0.00%) 47.38 ( 46.15%) Amean free-odr0-16 83.00 ( 0.00%) 46.00 ( 44.58%) Amean free-odr0-32 80.00 ( 0.00%) 44.38 ( 44.52%) Amean free-odr0-64 72.62 ( 0.00%) 43.00 ( 40.78%) Amean free-odr0-128 78.00 ( 0.00%) 42.00 ( 46.15%) Amean free-odr0-256 80.46 ( 0.00%) 57.00 ( 29.16%) Amean free-odr0-512 96.38 ( 0.00%) 64.69 ( 32.88%) Amean free-odr0-1024 107.31 ( 0.00%) 72.54 ( 32.40%) Amean free-odr0-2048 108.92 ( 0.00%) 78.08 ( 28.32%) Amean free-odr0-4096 113.38 ( 0.00%) 82.23 ( 27.48%) Amean free-odr0-8192 112.08 ( 0.00%) 82.85 ( 26.08%) Amean free-odr0-16384 110.38 ( 0.00%) 81.92 ( 25.78%) Amean total-odr0-1 452.15 ( 0.00%) 308.00 ( 31.88%) Amean total-odr0-2 334.23 ( 0.00%) 246.23 ( 26.33%) Amean total-odr0-4 286.00 ( 0.00%) 205.38 ( 28.19%) Amean total-odr0-8 255.54 ( 0.00%) 180.15 ( 29.50%) Amean total-odr0-16 239.00 ( 0.00%) 169.00 ( 29.29%) Amean total-odr0-32 229.00 ( 0.00%) 162.69 ( 28.96%) Amean total-odr0-64 211.38 ( 0.00%) 159.00 ( 24.78%) Amean total-odr0-128 223.00 ( 0.00%) 160.00 ( 28.25%) Amean total-odr0-256 216.62 ( 0.00%) 182.00 ( 15.98%) Amean total-odr0-512 244.31 ( 0.00%) 186.46 ( 23.68%) Amean total-odr0-1024 254.54 ( 0.00%) 198.69 ( 21.94%) Amean total-odr0-2048 264.08 ( 0.00%) 208.00 ( 21.24%) Amean total-odr0-4096 277.38 ( 0.00%) 219.00 ( 21.05%) Amean total-odr0-8192 279.00 ( 0.00%) 220.92 ( 20.82%) Amean total-odr0-16384 269.38 ( 0.00%) 219.92 ( 18.36%) This is the alloc, free and total overhead of allocating order-0 pages in batches of 1 page up to 16384 pages. Avoiding disabling/enabling overhead massively reduces overhead. Alloc overhead is roughly reduced by 14-20% in most cases. The free path is reduced by 26-46% and the total reduction is significant. Many users require zeroing of pages from the page allocator which is the vast cost of allocation. Hence, the impact on a basic page faulting benchmark is not that significant 4.10.0-rc2 4.10.0-rc2 vanilla irqsafe-v1r5 Hmean page_test 656632.98 ( 0.00%) 675536.13 ( 2.88%) Hmean brk_test 3845502.67 ( 0.00%) 3867186.94 ( 0.56%) Stddev page_test 10543.29 ( 0.00%) 4104.07 ( 61.07%) Stddev brk_test 33472.36 ( 0.00%) 15538.39 ( 53.58%) CoeffVar page_test 1.61 ( 0.00%) 0.61 ( 62.15%) CoeffVar brk_test 0.87 ( 0.00%) 0.40 ( 53.84%) Max page_test 666513.33 ( 0.00%) 678640.00 ( 1.82%) Max brk_test 3882800.00 ( 0.00%) 3887008.66 ( 0.11%) This is from aim9 and the most notable outcome is that fault variability is reduced by the patch. The headline improvement is small as the overall fault cost, zeroing, page table insertion etc dominate relative to disabling/enabling IRQs in the per-cpu allocator. Similarly, little benefit was seen on networking benchmarks both localhost and between physical server/clients where other costs dominate. It's possible that this will only be noticable on very high speed networks. Jesper Dangaard Brouer independently tested this with a separate microbenchmark from https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench Micro-benchmarked with [1] page_bench02: modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \ rmmod page_bench02 ; dmesg --notime | tail -n 4 Compared to baseline: 213 cycles(tsc) 53.417 ns - against this : 184 cycles(tsc) 46.056 ns - Saving : -29 cycles - Very close to expected 27 cycles saving [see below [2]] Micro benchmarking via time_bench_sample[3], we get the cost of these operations: time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.232 ns (step:0) time_bench: Type:spin_lock_unlock Per elem: 33 cycles(tsc) 8.334 ns (step:0) time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0) time_bench: Type:irqsave_before_lock Per elem: 57 cycles(tsc) 14.344 ns (step:0) time_bench: Type:spin_lock_unlock_irq Per elem: 34 cycles(tsc) 8.560 ns (step:0) time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0) time_bench: Type:local_BH_disable_enable Per elem: 19 cycles(tsc) 4.920 ns (step:0) time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0) time_bench: Type:local_irq_save_restore Per elem: 38 cycles(tsc) 9.665 ns (step:0) [Mel's patch removes a ^^^^^^^^^^^^^^^^] ^^^^^^^^^ expected saving - preempt cost time_bench: Type:preempt_disable_enable Per elem: 11 cycles(tsc) 2.794 ns (step:0) [adds a preempt ^^^^^^^^^^^^^^^^^^^^^^] ^^^^^^^^^ adds this cost time_bench: Type:funcion_call_cost Per elem: 6 cycles(tsc) 1.689 ns (step:0) time_bench: Type:func_ptr_call_cost Per elem: 11 cycles(tsc) 2.767 ns (step:0) time_bench: Type:page_alloc_put Per elem: 211 cycles(tsc) 52.803 ns (step:0) Thus, expected improvement is: 38-11 = 27 cycles. [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/] Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Dmitry has reported the following lockdep splat lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753 __mutex_lock_common kernel/locking/mutex.c:521 [inline] mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621 pcpu_alloc+0xbda/0x1280 mm/percpu.c:896 __alloc_percpu+0x24/0x30 mm/percpu.c:1075 smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44 cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136 cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493 _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057 do_cpu_up+0x73/0xa0 kernel/cpu.c:1087 cpu_up+0x18/0x20 kernel/cpu.c:1095 smp_init+0xe9/0xee kernel/smp.c:564 kernel_init_freeable+0x439/0x690 init/main.c:1010 kernel_init+0x13/0x180 init/main.c:941 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433 cpu_hotplug_begin cpu_hotplug.lock pcpu_alloc pcpu_alloc_mutex get_online_cpus+0x62/0x90 kernel/cpu.c:248 drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385 __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline] __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778 __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980 __alloc_pages include/linux/gfp.h:426 [inline] __alloc_pages_node include/linux/gfp.h:439 [inline] alloc_pages_node include/linux/gfp.h:453 [inline] pcpu_alloc_pages mm/percpu-vm.c:93 [inline] pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282 pcpu_alloc+0xe01/0x1280 mm/percpu.c:998 __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062 bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline] array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99 find_and_alloc_map kernel/bpf/syscall.c:34 [inline] map_create kernel/bpf/syscall.c:188 [inline] SYSC_bpf kernel/bpf/syscall.c:870 [inline] SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827 entry_SYSCALL_64_fastpath+0x1f/0xc2 pcpu_alloc pcpu_alloc_mutex drain_all_pages get_online_cpus cpu_hotplug.lock cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304 _cpu_up+0xca/0x2a0 kernel/cpu.c:1011 do_cpu_up+0x73/0xa0 kernel/cpu.c:1087 cpu_up+0x18/0x20 kernel/cpu.c:1095 smp_init+0xe9/0xee kernel/smp.c:564 kernel_init_freeable+0x439/0x690 init/main.c:1010 kernel_init+0x13/0x180 init/main.c:941 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433 cpu_hotplug_begin cpu_hotplug.lock Pulling cpu hotplug locks inside the page allocator is just too dangerous. Let's remove the dependency by dropping get_online_cpus() from drain_all_pages. This is not so simple though because now we do not have a protection against cpu hotplug which means 2 things: - the work item might be executed on a different cpu in worker from unbound pool so it doesn't run on pinned on the cpu - we have to make sure that we do not race with page_alloc_cpu_dead calling drain_pages_zone Disabling preemption in drain_local_pages_wq will solve the first problem drain_local_pages will determine its local CPU from the WQ context which will be stable after that point, page_alloc_cpu_dead is pinned to the CPU already. The later condition is achieved by disabling IRQs in drain_pages_zone. Fixes: mm, page_alloc: drain per-cpu pages from workqueue context Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The per-cpu page allocator can be drained immediately via drain_all_pages() which sends IPIs to every CPU. In the next patch, the per-cpu allocator will only be used for interrupt-safe allocations which prevents draining it from IPI context. This patch uses workqueues to drain the per-cpu lists instead. This is slower but no slowdown during intensive reclaim was measured and the paths that use drain_all_pages() are not that sensitive to performance. This is particularly true as the path would only be triggered when reclaim is failing. It also makes a some sense to avoid storming a machine with IPIs when it's under memory pressure. Arguably, it should be further adjusted so that only one caller at a time is draining pages but it's beyond the scope of the current patch. Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
alloc_pages_nodemask does a number of preperation steps that determine what zones can be used for the allocation depending on a variety of factors. This is fine but a hypothetical caller that wanted multiple order-0 pages has to do the preparation steps multiple times. This patch structures __alloc_pages_nodemask such that it's relatively easy to build a bulk order-0 page allocator. There is no functional change. Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Patch series "Use per-cpu allocator for !irq requests and prepare for a bulk allocator", v5. This series is motivated by a conversation led by Jesper Dangaard Brouer at the last LSF/MM proposing a generic page pool for DMA-coherent pages. Part of his motivation was due to the overhead of allocating multiple order-0 that led some drivers to use high-order allocations and splitting them. This is very slow in some cases. The first two patches in this series restructure the page allocator such that it is relatively easy to introduce an order-0 bulk page allocator. A patch exists to do that and has been handed over to Jesper until an in-kernel users is created. The third patch prevents the per-cpu allocator being drained from IPI context as that can potentially corrupt the list after patch four is merged. The final patch alters the per-cpu alloctor to make it exclusive to !irq requests. This cuts allocation/free overhead by roughly 30%. Performance tests from both Jesper and me are included in the patch. This patch (of 4): buffered_rmqueue removes a page from a given zone and uses the per-cpu list for order-0. This is fine but a hypothetical caller that wanted multiple order-0 pages has to disable/reenable interrupts multiple times. This patch structures buffere_rmqueue such that it's relatively easy to build a bulk order-0 page allocator. There is no functional change. [mgorman@techsingularity.net: failed per-cpu refill may blow up] Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
We noticed a performance regression when moving hadoop workloads from 3.10 kernels to 4.0 and 4.6. This is accompanied by increased pageout activity initiated by kswapd as well as frequent bursts of allocation stalls and direct reclaim scans. Even lowering the dirty ratios to the equivalent of less than 1% of memory would not eliminate the issue, suggesting that dirty pages concentrate where the scanner is looking. This can be traced back to recent efforts of thrash avoidance. Where 3.10 would not detect refaulting pages and continuously supply clean cache to the inactive list, a thrashing workload on 4.0+ will detect and activate refaulting pages right away, distilling used-once pages on the inactive list much more effectively. This is by design, and it makes sense for clean cache. But for the most part our workload's cache faults are refaults and its use-once cache is from streaming writes. We end up with most of the inactive list dirty, and we don't go after the active cache as long as we have use-once pages around. But waiting for writes to avoid reclaiming clean cache that *might* refault is a bad trade-off. Even if the refaults happen, reads are faster than writes. Before getting bogged down on writeback, reclaim should first look at *all* cache in the system, even active cache. To accomplish this, activate pages that are dirty or under writeback when they reach the end of the inactive LRU. The pages are marked for immediate reclaim, meaning they'll get moved back to the inactive LRU tail as soon as they're written back and become reclaimable. But in the meantime, by reducing the inactive list to only immediately reclaimable pages, we allow the scanner to deactivate and refill the inactive list with clean cache from the active list tail to guarantee forward progress. [hannes@cmpxchg.org: update comment] Link: http://lkml.kernel.org/r/20170202191957.22872-8-hannes@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Dirty pages can easily reach the end of the LRU while there are still clean pages to reclaim around. Don't let kswapd write them back just because there are a lot of them. It costs more CPU to find the clean pages, but that's almost certainly better than to disrupt writeback from the flushers with LRU-order single-page writes from reclaim. And the flushers have been woken up by that point, so we spend IO capacity on flushing and CPU capacity on finding the clean cache. Only start writing dirty pages if they have cycled around the LRU twice now and STILL haven't been queued on the IO device. It's possible that the dirty pages are so sparsely distributed across different bdis, inodes, memory cgroups, that the flushers take forever to get to the ones we want reclaimed. Once we see them twice on the LRU, we know that's the quicker way to find them, so do LRU writeback. Link: http://lkml.kernel.org/r/20170123181641.23938-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Direct reclaim has been replaced by kswapd reclaim in pretty much all common memory pressure situations, so this code most likely doesn't accomplish the described effect anymore. The previous patch wakes up flushers for all reclaimers when we encounter dirty pages at the tail end of the LRU. Remove the crufty old direct reclaim invocation. Link: http://lkml.kernel.org/r/20170123181641.23938-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Memory pressure can put dirty pages at the end of the LRU without anybody running into dirty limits. Don't start writing individual pages from kswapd while the flushers might be asleep. Unlike the old direct reclaim flusher wakeup (removed in the next patch) that flushes the number of pages just scanned, this patch wakes the flushers for all outstanding dirty pages. That seemed to perform better in a synthetic test that pushes dirty pages to the end of the LRU and into reclaim, because we know LRU aging outstrips writeback already, and this way we give younger dirty pages a headstart rather than wait until reclaim runs into them as well. It also means less plugging and risk of exhausting the struct request pool from reclaim. There is a concern that this will cause temporary files that used to get dirtied and truncated before writeback to now get written to disk under memory pressure. If this turns out to be a real problem, we'll have to revisit this and tame the reclaim flusher wakeups. [hannes@cmpxchg.org: mention dirty expiration as a condition] Link: http://lkml.kernel.org/r/20170126174739.GA30636@cmpxchg.org Link: http://lkml.kernel.org/r/20170123181641.23938-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Patch series "mm: vmscan: fix kswapd writeback regression". We noticed a regression on multiple hadoop workloads when moving from 3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page writeout, causing direct reclaim herds that also don't make progress. I tracked it down to the thrash avoidance efforts after 3.10 that make the kernel better at keeping use-once cache and use-many cache sorted on the inactive and active list, with more aggressive protection of the active list as long as there is inactive cache. Unfortunately, our workload's use-once cache is mostly from streaming writes. Waiting for writes to avoid potential reloads in the future is not a good tradeoff. These patches do the following: 1. Wake the flushers when kswapd sees a lump of dirty pages. It's possible to be below the dirty background limit and still have cache velocity push them through the LRU. So start a-flushin'. 2. Let kswapd only write pages that have been rotated twice. This makes sure we really tried to get all the clean pages on the inactive list before resorting to horrible LRU-order writeback. 3. Move rotating dirty pages off the inactive list. Instead of churning or waiting on page writeback, we'll go after clean active cache. This might lead to thrashing, but in this state memory demand outstrips IO speed anyway, and reads are faster than writes. Mel backported the series to 4.10-rc5 with one minor conflict and ran a couple of tests on it. Mix of read/write random workload didn't show anything interesting. Write-only database didn't show much difference in performance but there were slight reductions in IO -- probably in the noise. simoop did show big differences although not as big as Mel expected. This is Chris Mason's workload that similate the VM activity of hadoop. Mel won't go through the full details but over the samples measured during an hour it reported 4.10.0-rc5 4.10.0-rc5 vanilla johannes-v1r1 Amean p50-Read 21346531.56 ( 0.00%) 21697513.24 ( -1.64%) Amean p95-Read 24700518.40 ( 0.00%) 25743268.98 ( -4.22%) Amean p99-Read 27959842.13 ( 0.00%) 28963271.11 ( -3.59%) Amean p50-Write 1138.04 ( 0.00%) 989.82 ( 13.02%) Amean p95-Write 1106643.48 ( 0.00%) 12104.00 ( 98.91%) Amean p99-Write 1569213.22 ( 0.00%) 36343.38 ( 97.68%) Amean p50-Allocation 85159.82 ( 0.00%) 79120.70 ( 7.09%) Amean p95-Allocation 204222.58 ( 0.00%) 129018.43 ( 36.82%) Amean p99-Allocation 278070.04 ( 0.00%) 183354.43 ( 34.06%) Amean final-p50-Read 21266432.00 ( 0.00%) 21921792.00 ( -3.08%) Amean final-p95-Read 24870912.00 ( 0.00%) 26116096.00 ( -5.01%) Amean final-p99-Read 28147712.00 ( 0.00%) 29523968.00 ( -4.89%) Amean final-p50-Write 1130.00 ( 0.00%) 977.00 ( 13.54%) Amean final-p95-Write 1033216.00 ( 0.00%) 2980.00 ( 99.71%) Amean final-p99-Write 1517568.00 ( 0.00%) 32672.00 ( 97.85%) Amean final-p50-Allocation 86656.00 ( 0.00%) 78464.00 ( 9.45%) Amean final-p95-Allocation 211712.00 ( 0.00%) 116608.00 ( 44.92%) Amean final-p99-Allocation 287232.00 ( 0.00%) 168704.00 ( 41.27%) The latencies are actually completely horrific in comparison to 4.4 (and 4.10-rc5 is worse than 4.9 according to historical data for reasons Mel hasn't analysed yet). Still, 95% of write latency (p95-write) is halved by the series and allocation latency is way down. Direct reclaim activity is one fifth of what it was according to vmstats. Kswapd activity is higher but this is not necessarily surprising. Kswapd efficiency is unchanged at 99% (99% of pages scanned were reclaimed) but direct reclaim efficiency went from 77% to 99% In the vanilla kernel, 627MB of data was written back from reclaim context. With the series, no data was written back. With or without the patch, pages are being immediately reclaimed after writeback completes. However, with the patch, only 1/8th of the pages are reclaimed like this. This patch (of 5): We have an elaborate dirty/writeback throttling mechanism inside the reclaim scanner, but for that to work the pages have to go through shrink_page_list() and get counted for what they are. Otherwise, we mess up the LRU order and don't match reclaim speed to writeback. Especially during deactivation, there is never a reason to skip dirty pages; nothing is even trying to write them out from there. Don't mess up the LRU order for nothing, shuffle these pages along. Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
When a page is removed from a shared mapping, the uffd reader should be notified, so that it won't attempt to handle #PF events for the removed pages. We can reuse the UFFD_EVENT_REMOVE because from the uffd monitor point of view, the semantices of madvise(MADV_DONTNEED) and madvise(MADV_REMOVE) is exactly the same. Link: http://lkml.kernel.org/r/1484814154-1557-3-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NPavel Emelyanov <xemul@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Rapoport 提交于
Patch series "userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request". These patches add notification of madvise(MADV_REMOVE) event to non-cooperative userfaultfd monitor. The first pacth renames EVENT_MADVDONTNEED to EVENT_REMOVE along with relevant functions and structures. Using _REMOVE instead of _MADVDONTNEED describes the event semantics more clearly and I hope it's not too late for such change in the ABI. This patch (of 3): The UFFD_EVENT_MADVDONTNEED purpose is to notify uffd monitor about removal of certain range from address space tracked by userfaultfd. Hence, UFFD_EVENT_REMOVE seems to better reflect the operation semantics. Respectively, 'madv_dn' field of uffd_msg is renamed to 'remove' and the madvise_userfault_dontneed callback is renamed to userfaultfd_remove. Link: http://lkml.kernel.org/r/1484814154-1557-2-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
Provide the name of each memblock type with struct memblock_type. This allows to get rid of the function memblock_type_name() and duplicating the type names in __memblock_dump_all(). The only memblock_type usage out of mm/memblock.c seems to be arch/s390/kernel/crash_dump.c. While at it, give it a name. Link: http://lkml.kernel.org/r/20170120123456.46508-4-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
Since commit 70210ed9 ("mm/memblock: add physical memory list") the memblock structure knows about a physical memory list. The physical memory list should also be dumped if memblock_dump_all() is called in case memblock_debug is switched on. This makes debugging a bit easier. Link: http://lkml.kernel.org/r/20170120123456.46508-3-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
Since commit 70210ed9 ("mm/memblock: add physical memory list") the memblock structure knows about a physical memory list. memblock_type_name() should return "physmem" instead of "unknown" if the name of the physmem memblock_type is being asked for. Link: http://lkml.kernel.org/r/20170120123456.46508-2-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Philipp Hachtmann <phacht@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
It has no modular callers. Cc: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
mem_hotplug_begin() assumes that it can set mem_hotplug.active_writer and run the hotplug process without racing another thread. Validate this assumption with a lockdep assertion. Link: http://lkml.kernel.org/r/148693886229.16345.1770484669403334689.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com> Reported-by: NBen Hutchings <ben@decadent.org.uk> Cc: Michal Hocko <mhocko@suse.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Commit 82e7d3ab ("oom: print nodemask in the oom report") implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by dump_header(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. [rientjes@google.com: newline per Hillf] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com> Suggested-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Claudio Imbrenda 提交于
Some architectures have a set of zero pages (coloured zero pages) instead of only one zero page, in order to improve the cache performance. In those cases, the kernel samepage merger (KSM) would merge all the allocated pages that happen to be filled with zeroes to the same deduplicated page, thus losing all the advantages of coloured zero pages. This behaviour is noticeable when a process accesses large arrays of allocated pages containing zeroes. A test I conducted on s390 shows that there is a speed penalty when KSM merges such pages, compared to not merging them or using actual zero pages from the start without breaking the COW. This patch fixes this behaviour. When coloured zero pages are present, the checksum of a zero page is calculated during initialisation, and compared with the checksum of the current canditate during merging. In case of a match, the normal merging routine is used to merge the page with the correct coloured zero page, which ensures the candidate page is checked to be equal to the target zero page. A sysfs entry is also added to toggle this behaviour, since it can potentially introduce performance regressions, especially on architectures without coloured zero pages. The default value is disabled, for backwards compatibility. With this patch, the performance with KSM is the same as with non COW-broken actual zero pages, which is also the same as without KSM. [akpm@linux-foundation.org: make zero_checksum and ksm_use_zero_pages __read_mostly, per Andrea] [imbrenda@linux.vnet.ibm.com: documentation for coloured zero pages deduplication] Link: http://lkml.kernel.org/r/1484927522-1964-1-git-send-email-imbrenda@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/1484850953-23941-1-git-send-email-imbrenda@linux.vnet.ibm.comSigned-off-by: NClaudio Imbrenda <imbrenda@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2017 16 次提交
-
-
由 zhong jiang 提交于
At present, Tying the first_num size to NCHUNKS_ORDER is confusing. the number of chunks is completely unrelated to the number of buddies. The patch limits the first_num to actual range of possible buddy indexes. and that is more reasonable and obvious without functional change. Link: http://lkml.kernel.org/r/1476776569-29504-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Suggested-by: NDan Streetman <ddstreet@ieee.org> Acked-by: NDan Streetman <ddstreet@ieee.org> Acked-by: NVitaly Wool <vitalywool@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miles Chen 提交于
There is no variable named flags in memblock_add() and memblock_reserve() so remove it from the log messages. This patch also cleans up the type casting for phys_addr_t by using %pa to print them. Link: http://lkml.kernel.org/r/1484720165-25403-1-git-send-email-miles.chen@mediatek.comSigned-off-by: NMiles Chen <miles.chen@mediatek.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Logic on whether we can reap pages from the VMA should match what we have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP VMAs, but we don't now. Let's just extract condition on which we can shoot down pagesi from a VMA with MADV_DONTNEED into separate function and use it in both places. Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
There's no users of zap_page_range() who wants non-NULL 'details'. Let's drop it. Link: http://lkml.kernel.org/r/20170118122429.43661-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
detail == NULL would give the same functionality as .check_swap_entries==true. Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
The only user of ignore_dirty is oom-reaper. But it doesn't really use it. ignore_dirty only has effect on file pages mapped with dirty pte. But oom-repear skips shared VMAs, so there's no way we can dirty file pte in them. Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets the allocation nodemask to cpuset_current_mems_allowed when there is no effective mempolicy. cpuset_current_mems_allowed is only effective when cpusets are enabled, which is also printed by warn_alloc(), so setting the nodemask to cpuset_current_mems_allowed is redundant and prevents debugging issues where ac->nodemask is not set properly in the page allocator. This provides better debugging output since cpuset_print_current_mems_allowed() is already provided. Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer we are left with requests which require to loop inside the allocator without invoking the oom killer (e.g. GFP_NOFS|__GFP_NOFAIL used by fs code) and so they might, in very unlikely situations, loop for ever - e.g. other parallel request could starve them. This patch tries to limit the likelihood of such a lockup by giving these __GFP_NOFAIL requests a chance to move on by consuming a small part of memory reserves. We are using ALLOC_HARDER which should be enough to prevent from the starvation by regular allocation requests, yet it shouldn't consume enough from the reserves to disrupt high priority requests (ALLOC_HIGH). While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback which enforces the cpusets but allows to fallback to ignore them if the first attempt fails. __GFP_NOFAIL requests can be considered important enough to allow cpuset runaway in order for the system to move on. It is highly unlikely that any of these will be GFP_USER anyway. Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__alloc_pages_may_oom makes sure to skip the OOM killer depending on the allocation request. This includes lowmem requests, costly high order requests and others. For a long time __GFP_NOFAIL acted as an override for all those rules. This is not documented and it can be quite surprising as well. E.g. GFP_NOFS requests are not invoking the OOM killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of the existing open coded loops around allocator to nofail request (and we have done that in the past) then such a change would have a non trivial side effect which is far from obvious. Note that the primary motivation for skipping the OOM killer is to prevent from pre-mature invocation. The exception has been added by commit 82553a93 ("oom: invoke oom killer for __GFP_NOFAIL"). The changelog points out that the oom killer has to be invoked otherwise the request would be looping for ever. But this argument is rather weak because the OOM killer doesn't really guarantee a forward progress for those exceptional cases: - it will hardly help to form costly order which in turn can result in the system panic because of no oom killable task in the end - I believe we certainly do not want to put the system down just because there is a nasty driver asking for order-9 page with GFP_NOFAIL not realizing all the consequences. It is much better this request would loop for ever than the massive system disruption - lowmem is also highly unlikely to be freed during OOM killer - GFP_NOFS request could trigger while there is still a lot of memory pinned by filesystems. This patch simply removes the __GFP_NOFAIL special case in order to have a more clear semantic without surprising side effects. Signed-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NNils Holland <nholland@tisys.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Tetsuo Handa has pointed out that commit 0a0337e0 ("mm, oom: rework oom detection") has subtly changed semantic for costly high order requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail right now. My code inspection didn't reveal any such users in the tree but it is true that this might lead to unexpected allocation failures and subsequent OOPs. __alloc_pages_slowpath wrt. GFP_NOFAIL is hard to follow currently. There are few special cases but we are lacking a catch all place to be sure we will not miss any case where the non failing allocation might fail. This patch reorganizes the code a bit and puts all those special cases under nopage label which is the generic go-to-fail path. Non failing allocations are retried or those that cannot retry like non-sleeping allocation go to the failure point directly. This should make the code flow much easier to follow and make it less error prone for future changes. While we are there we have to move the stall check up to catch potentially looping non-failing allocations. [akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized] Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
show_mem() allows to filter out node specific data which is irrelevant to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is done in skip_free_areas_node which skips all nodes which are not in the mems_allowed of the current process. This works most of the time as expected because the nodemask shouldn't be outside of the allocating task but there are some exceptions. E.g. memory hotplug might want to request allocations from outside of the allowed nodes (see new_node_page). Get rid of this hardcoded behavior and push the allocation mask down the show_mem path and use it instead of cpuset_current_mems_allowed. NULL nodemask is interpreted as cpuset_current_mems_allowed. [akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
warn_alloc is currently used for to report an allocation failure or an allocation stall. We print some details of the allocation request like the gfp mask and the request order. We do not print the allocation nodemask which is important when debugging the reason for the allocation failure as well. We alreaddy print the nodemask in the OOM report. Add nodemask to warn_alloc and print it in warn_alloc as well. Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Patch series "show_mem updates", v2. This is a mixture of one bug fix (patch 1), an enhancement (patch 2) and cleanups (the rest of the series). First two patches should be really straightforward. Patch 3 removes some arch specific show_mem implementations because I think they are quite outdated and do not really serve any useful purpose anymore. I think we should really strive to have a consistent show_mem output regardless of the architecture. If some architecture is really special and wants to dump something additional we should do that via an arch specific hook. The last patch adds nodemask parameter so that we do not rely on the hardcoded mems_allowed of the current task when doing the node filtering. I consider this more a cleanup than a fix because basically all users use a nodemask which is a subset of mems_allowed. There is only one call path in the memory hotplug which doesn't comply with this but that is hardly something to worry about. This patch (of 4): Commit 599d0c95 ("mm, vmscan: move LRU lists to node") has added per numa node statistics to show_mem but it forgot to add skip_free_areas_node to filter out nodes which are outside of the allocating task numa policy. Add this check to not pollute the output with the pointless information. Link: http://lkml.kernel.org/r/20170117091543.25850-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
This reverts commit 91dcade4. inactive_reclaimable_pages shouldn't be needed anymore since that get_scan_count is aware of the eligble zones ("mm, vmscan: consider eligible zones in get_scan_count"). Link: http://lkml.kernel.org/r/20170117103702.28542-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpchxg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
get_scan_count() considers the whole node LRU size when - doing SCAN_FILE due to many page cache inactive pages - calculating the number of pages to scan In both cases this might lead to unexpected behavior especially on 32b systems where we can expect lowmem memory pressure very often. A large highmem zone can easily distort SCAN_FILE heuristic because there might be only few file pages from the eligible zones on the node lru and we would still enforce file lru scanning which can lead to trashing while we could still scan anonymous pages. The later use of lruvec_lru_size can be problematic as well. Especially when there are not many pages from the eligible zones. We would have to skip over many pages to find anything to reclaim but shrink_node_memcg would only reduce the remaining number to scan by SWAP_CLUSTER_MAX at maximum. Therefore we can end up going over a large LRU many times without actually having chance to reclaim much if anything at all. The closer we are out of memory on lowmem zone the worse the problem will be. Fix this by filtering out all the ineligible zones when calculating the lru size for both paths and consider only sc->reclaim_idx zones. The patch would need to be tweaked a bit to apply to 4.10 and older but I will do that as soon as it hits the Linus tree in the next merge window. Link: http://lkml.kernel.org/r/20170117103702.28542-3-mhocko@kernel.org Fixes: b2e18757 ("mm, vmscan: begin reclaiming pages on a per-node basis") Signed-off-by: NMichal Hocko <mhocko@suse.com> Tested-by: NTrevor Cordes <trevor@tecnopolis.ca> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [4.8+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
lruvec_lru_size returns the full size of the LRU list while we sometimes need a value reduced only to eligible zones (e.g. for lowmem requests). inactive_list_is_low is one such user. Later patches will add more of them. Add a new parameter to lruvec_lru_size and allow it filter out zones which are not eligible for the given context. Link: http://lkml.kernel.org/r/20170117103702.28542-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com> Acked-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-