- 18 3月, 2016 5 次提交
-
-
由 Johannes Weiner 提交于
In machines with 140G of memory and enterprise flash storage, we have seen read and write bursts routinely exceed the kswapd watermarks and cause thundering herds in direct reclaim. Unfortunately, the only way to tune kswapd aggressiveness is through adjusting min_free_kbytes - the system's emergency reserves - which is entirely unrelated to the system's latency requirements. In order to get kswapd to maintain a 250M buffer of free memory, the emergency reserves need to be set to 1G. That is a lot of memory wasted for no good reason. On the other hand, it's reasonable to assume that allocation bursts and overall allocation concurrency scale with memory capacity, so it makes sense to make kswapd aggressiveness a function of that as well. Change the kswapd watermark scale factor from the currently fixed 25% of the tunable emergency reserve to a tunable 0.1% of memory. Beyond 1G of memory, this will produce bigger watermark steps than the current formula in default settings. Ensure that the new formula never chooses steps smaller than that, i.e. 25% of the emergency reserve. On a 140G machine, this raises the default watermark steps - the distance between min and low, and low and high - from 16M to 143M. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Igor Redko 提交于
Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory statistics protocol, corresponding to 'Available' in /proc/meminfo. It indicates to the hypervisor how big the balloon can be inflated without pushing the guest system to swap. This metric would be very useful in VM orchestration software to improve memory management of different VMs under overcommit. This patch (of 2): Factor out calculation of the available memory counter into a separate exportable function, in order to be able to use it in other parts of the kernel. In particular, it appears a relevant metric to report to the hypervisor via virtio-balloon statistics interface (in a followup patch). Signed-off-by: NIgor Redko <redkoi@virtuozzo.com> Signed-off-by: NDenis V. Lunev <den@openvz.org> Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Memory compaction can be currently performed in several contexts: - kswapd balancing a zone after a high-order allocation failure - direct compaction to satisfy a high-order allocation, including THP page fault attemps - khugepaged trying to collapse a hugepage - manually from /proc The purpose of compaction is two-fold. The obvious purpose is to satisfy a (pending or future) high-order allocation, and is easy to evaluate. The other purpose is to keep overal memory fragmentation low and help the anti-fragmentation mechanism. The success wrt the latter purpose is more The current situation wrt the purposes has a few drawbacks: - compaction is invoked only when a high-order page or hugepage is not available (or manually). This might be too late for the purposes of keeping memory fragmentation low. - direct compaction increases latency of allocations. Again, it would be better if compaction was performed asynchronously to keep fragmentation low, before the allocation itself comes. - (a special case of the previous) the cost of compaction during THP page faults can easily offset the benefits of THP. - kswapd compaction appears to be complex, fragile and not working in some scenarios. It could also end up compacting for a high-order allocation request when it should be reclaiming memory for a later order-0 request. To improve the situation, we should be able to benefit from an equivalent of kswapd, but for compaction - i.e. a background thread which responds to fragmentation and the need for high-order allocations (including hugepages) somewhat proactively. One possibility is to extend the responsibilities of kswapd, which could however complicate its design too much. It should be better to let kswapd handle reclaim, as order-0 allocations are often more critical than high-order ones. Another possibility is to extend khugepaged, but this kthread is a single instance and tied to THP configs. This patch goes with the option of a new set of per-node kthreads called kcompactd, and lays the foundations, without introducing any new tunables. The lifecycle mimics kswapd kthreads, including the memory hotplug hooks. For compaction, kcompactd uses the standard compaction_suitable() and ompact_finished() criteria and the deferred compaction functionality. Unlike direct compaction, it uses only sync compaction, as there's no allocation latency to minimize. This patch doesn't yet add a call to wakeup_kcompactd. The kswapd compact/reclaim loop for high-order pages will be replaced by waking up kcompactd in the next patch with the description of what's wrong with the old approach. Waking up of the kcompactd threads is also tied to kswapd activity and follows these rules: - we don't want to affect any fastpaths, so wake up kcompactd only from the slowpath, as it's done for kswapd - if kswapd is doing reclaim, it's more important than compaction, so don't invoke kcompactd until kswapd goes to sleep - the target order used for kswapd is passed to kcompactd Future possible future uses for kcompactd include the ability to wake up kcompactd on demand in special situations, such as when hugepages are not available (currently not done due to __GFP_NO_KSWAPD) or when a fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also possible to perform periodic compaction with kcompactd. [arnd@arndb.de: fix build errors with kcompactd] [paul.gortmaker@windriver.com: don't use modular references for non modular code] Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
We can disable debug_pagealloc processing even if the code is compiled with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query whether it is enabled or not in runtime. [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NTakashi Iwai <tiwai@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Currently /proc/kpageflags returns nothing for "tail" buddy pages, which is inconvenient when grasping how free pages are distributed. This patch sets KPF_BUDDY for such pages. With this patch: $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy MemFree: 3134992 kB flags page-count MB symbolic-flags long-symbolic-flags 0x0000000000000400 779272 3044 __________B_______________________________ buddy 0x0000000000000c00 4385 17 __________BM______________________________ buddy,mmap total 783657 3061 783657 pages is 3134628 kB (roughly consistent with the global counter,) so it's OK. [akpm@linux-foundation.org: update comment, per Naoya] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 3月, 2016 11 次提交
-
-
由 Joonsoo Kim 提交于
There is a performance drop report due to hugepage allocation and in there half of cpu time are spent on pageblock_pfn_to_page() in compaction [1]. In that workload, compaction is triggered to make hugepage but most of pageblocks are un-available for compaction due to pageblock type and skip bit so compaction usually fails. Most costly operations in this case is to find valid pageblock while scanning whole zone range. To check if pageblock is valid to compact, valid pfn within pageblock is required and we can obtain it by calling pageblock_pfn_to_page(). This function checks whether pageblock is in a single zone and return valid pfn if possible. Problem is that we need to check it every time before scanning pageblock even if we re-visit it and this turns out to be very expensive in this workload. Although we have no way to skip this pageblock check in the system where hole exists at arbitrary position, we can use cached value for zone continuity and just do pfn_to_page() in the system where hole doesn't exist. This optimization considerably speeds up in above workload. Before vs After Max: 1096 MB/s vs 1325 MB/s Min: 635 MB/s 1015 MB/s Avg: 899 MB/s 1194 MB/s Avg is improved by roughly 30% [2]. [1]: http://www.spinics.net/lists/linux-mm/msg97378.html [2]: https://lkml.org/lkml/2015/12/9/23 [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil] Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Reported-by: NAaron Lu <aaron.lu@intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Tested-by: NAaron Lu <aaron.lu@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
By default, page poisoning uses a poison value (0xaa) on free. If this is changed to 0, the page is not only sanitized but zeroing on alloc with __GFP_ZERO can be skipped as well. The tradeoff is that detecting corruption from the poisoning is harder to detect. This feature also cannot be used with hibernation since pages are not guaranteed to be zeroed after hibernation. Credit to Grsecurity/PaX team for inspiring this work Signed-off-by: NLaura Abbott <labbott@fedoraproject.org> Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Laura Abbott 提交于
Page poisoning is currently set up as a feature if architectures don't have architecture debug page_alloc to allow unmapping of pages. It has uses apart from that though. Clearing of the pages on free provides an increase in security as it helps to limit the risk of information leaks. Allow page poisoning to be enabled as a separate option independent of kernel_map pages since the two features do separate work. Because of how hiberanation is implemented, the checks on alloc cannot occur if hibernation is enabled. The runtime alloc checks can also be enabled with an option when !HIBERNATION. Credit to Grsecurity/PaX team for inspiring this work Signed-off-by: NLaura Abbott <labbott@fedoraproject.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Jianyu Zhan <nasa4836@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
Since bad_page() is the only user of the badflags parameter of dump_page_badflags(), we can move the code to bad_page() and simplify a bit. The dump_page_badflags() function is renamed to __dump_page() and can still be called separately from dump_page() for temporary debug prints where page_owner info is not desired. The only user-visible change is that page->mem_cgroup is printed before the bad flags. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The page_owner mechanism is useful for dealing with memory leaks. By reading /sys/kernel/debug/page_owner one can determine the stack traces leading to allocations of all pages, and find e.g. a buggy driver. This information might be also potentially useful for debugging, such as the VM_BUG_ON_PAGE() calls to dump_page(). So let's print the stored info from dump_page(). Example output: page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk) page dumped because: VM_BUG_ON_PAGE(1) page->mem_cgroup:ffff8801392c5000 page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY) [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230 [<ffffffff811b40c8>] alloc_pages_current+0x88/0x120 [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120 [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240 [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260 [<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70 [<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760 [<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0 page has been migrated, last migrate reason: compaction Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The information in /sys/kernel/debug/page_owner includes the migratetype of the pageblock the page belongs to. This is also checked against the page's migratetype (as declared by gfp_flags during its allocation), and the page is reported as Fallback if its migratetype differs from the pageblock's one. t This is somewhat misleading because in fact fallback allocation is not the only reason why these two can differ. It also doesn't direcly provide the page's migratetype, although it's possible to derive that from the gfp_flags. It's arguably better to print both page and pageblock's migratetype and leave the interpretation to the consumer than to suggest fallback allocation as the only possible reason. While at it, we can print the migratetypes as string the same way as /proc/pagetypeinfo does, as some of the numeric values depend on kernel configuration. For that, this patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part of mm/vmstat.c to mm/page_alloc.c and exports it. With the new format strings for flags, we can now also provide symbolic page and gfp flags in the /sys/kernel/debug/page_owner file. This replaces the positional printing of page flags as single letters, which might have looked nicer, but was limited to a subset of flags, and required the user to remember the letters. Example page_owner entry after the patch: Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY) PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk) [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230 [<ffffffff811b4058>] alloc_pages_current+0x88/0x120 [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120 [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240 [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260 [<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50 [<ffffffff81160523>] generic_file_read_iter+0x453/0x760 [<ffffffff811e0d57>] __vfs_read+0xa7/0xd0 Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
It would be useful to translate gfp_flags into string representation when printing in case of an allocation failure, especially as the flags have been undergoing some changes recently and the script ./scripts/gfp-translate needs a matching source version to be accurate. Example output: stapio: page allocation failure: order:9, mode:0x2080020(GFP_ATOMIC) Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christian Borntraeger 提交于
Since commit 031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") CONFIG_DEBUG_PAGEALLOC is by default not adding any page debugging. This resulted in several unnoticed bugs, e.g. https://lkml.kernel.org/g/<569F5E29.3090107@de.ibm.com> or https://lkml.kernel.org/g/<56A20F30.4050705@de.ibm.com> as this behaviour change was not even documented in Kconfig. Let's provide a new Kconfig symbol that allows to change the default back to enabled, e.g. for debug kernels. This also makes the change obvious to kernel packagers. Let's also change the Kconfig description for CONFIG_DEBUG_PAGEALLOC, to indicate that there are two stages of overhead. Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
This function is getting full of weird tricks to avoid word-wrapping. Use a goto to eliminate a tab stop then use the new space Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Taku Izumi 提交于
This patch extends existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored (non-reliable) region will be arranged into ZONE_MOVABLE. [akpm@linux-foundation.org: fix build with CONFIG_HAVE_MEMBLOCK_NODE_MAP=n] Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com> Tested-by: NSudeep Holla <sudeep.holla@arm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Steve Capper <steve.capper@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Taku Izumi 提交于
Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS complied with UEFI spec 2.5 can notify which ranges are mirrored (reliable) via EFI memory map. Now Linux kernel utilize its information and allocates boot time memory from reliable region. My requirement is: - allocate kernel memory from mirrored region - allocate user memory from non-mirrored region In order to meet my requirement, ZONE_MOVABLE is useful. By arranging non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel allocations. My idea is to extend existing "kernelcore" option and introduces kernelcore=mirror option. By specifying "mirror" instead of specifying the amount of memory, non-mirrored region will be arranged into ZONE_MOVABLE. Earlier discussions are at: https://lkml.org/lkml/2015/10/9/24 https://lkml.org/lkml/2015/10/15/9 https://lkml.org/lkml/2015/11/27/18 https://lkml.org/lkml/2015/12/8/836 For example, suppose 2-nodes system with the following memory range: node 0 [mem 0x0000000000001000-0x000000109fffffff] node 1 [mem 0x00000010a0000000-0x000000209fffffff] and the following ranges are marked as reliable (mirrored): [0x0000000000000000-0x0000000100000000] [0x0000000100000000-0x0000000180000000] [0x0000000800000000-0x0000000880000000] [0x00000010a0000000-0x0000001120000000] [0x00000017a0000000-0x0000001820000000] If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are arranged like bellow: - node 0: ZONE_NORMAL : [0x0000000100000000-0x00000010a0000000] ZONE_MOVABLE: [0x0000000180000000-0x00000010a0000000] - node 1: ZONE_NORMAL : [0x00000010a0000000-0x00000020a0000000] ZONE_MOVABLE: [0x0000001120000000-0x00000020a0000000] In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated as absent pages, and vice versa. This patch (of 2): Currently each zone's zone_start_pfn is calculated at free_area_init_core(). However zone's range is fixed at the time when invoking zone_spanned_pages_in_node(). This patch changes how each zone->zone_start_pfn is calculated in zone_spanned_pages_in_node(). Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Xishi Qiu <qiuxishi@huawei.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Steve Capper <steve.capper@linaro.org> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 2月, 2016 1 次提交
-
-
由 Vlastimil Babka 提交于
Commit 944d9fec ("hugetlb: add support for gigantic page allocation at runtime") has added the runtime gigantic page allocation via alloc_contig_range(), making this support available only when CONFIG_CMA is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the associated infrastructure, it is possible with few simple adjustments to require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA. After this patch, alloc_contig_range() and related functions are available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows supporting runtime gigantic pages without the CMA-specific checks in page allocator fastpaths. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 2月, 2016 1 次提交
-
-
由 Kirill A. Shutemov 提交于
Andrea Arcangeli suggested to make split queue per-node to improve scalability. Let's do it. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Suggested-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 1月, 2016 5 次提交
-
-
由 Alexander Kuleshov 提交于
Remove unused struct zone *z variable which appeared in 86051ca5 ("mm: fix usemap initialization"). Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Williams 提交于
In support of providing struct page for large persistent memory capacities, use struct vmem_altmap to change the default policy for allocating memory for the memmap array. The default vmemmap_populate() allocates page table storage area from the page allocator. Given persistent memory capacities relative to DRAM it may not be feasible to store the memmap in 'System Memory'. Instead vmem_altmap represents pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf() requests. Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reported-by: Nkbuild test robot <lkp@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Currently we don't split huge page on partial unmap. It's not an ideal situation. It can lead to memory overhead. Furtunately, we can detect partial unmap on page_remove_rmap(). But we cannot call split_huge_page() from there due to locking context. It's also counterproductive to do directly from munmap() codepath: in many cases we will hit this from exit(2) and splitting the huge page just to free it up in small pages is not what we really want. The patch introduce deferred_split_huge_page() which put the huge page into queue for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. The page will be dropped from list on freeing through compound page destructor. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We're going to allow mapping of individual 4k pages of THP compound. It means we need to track mapcount on per small page basis. Straight-forward approach is to use ->_mapcount in all subpages to track how many time this subpage is mapped with PMDs or PTEs combined. But this is rather expensive: mapping or unmapping of a THP page with PMD would require HPAGE_PMD_NR atomic operations instead of single we have now. The idea is to store separately how many times the page was mapped as whole -- compound_mapcount. This frees up ->_mapcount in subpages to track PTE mapcount. We use the same approach as with compound page destructor and compound order to store compound_mapcount: use space in first tail page, ->mapping this time. Any time we map/unmap whole compound page (THP or hugetlb) -- we increment/decrement compound_mapcount. When we map part of compound page with PTE we operate on ->_mapcount of the subpage. page_mapcount() counts both: PTE and PMD mappings of the page. Basically, we have mapcount for a subpage spread over two counters. It makes tricky to detect when last mapcount for a page goes away. We introduced PageDoubleMap() for this. When we split THP PMD for the first time and there's other PMD mapping left we offset up ->_mapcount in all subpages by one and set PG_double_map on the compound page. These additional references go away with last compound_mapcount. This approach provides a way to detect when last mapcount goes away on per small page basis without introducing new overhead for most common cases. [akpm@linux-foundation.org: fix typo in comment] [mhocko@suse.com: ignore partial THP when moving task] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: NJerome Marchand <jmarchan@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
We don't define meaning of page->mapping for tail pages. Currently it's always NULL, which can be inconsistent with head page and potentially lead to problems. Let's poison the pointer to catch all illigal uses. page_rmapping(), page_mapping() and page_anon_vma() are changed to look on head page. The only illegal use I've caught so far is __GPF_COMP pages from sound subsystem, mapped with PTEs. do_shared_fault() is changed to use page_rmapping() instead of direct access to fault_page->mapping. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NJérôme Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Steve Capper <steve.capper@linaro.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Jerome Marchand <jmarchan@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2016 10 次提交
-
-
由 Michal Hocko 提交于
__GFP_NOFAIL is a big hammer used to ensure that the allocation request can never fail. This is a strong requirement and as such it also deserves a special treatment when the system is OOM. The primary problem here is that the allocation request might have come with some locks held and the oom victim might be blocked on the same locks. This is basically an OOM deadlock situation. This patch tries to reduce the risk of such a deadlocks by giving __GFP_NOFAIL allocations a special treatment and let them dive into memory reserves after oom killer invocation. This should help them to make a progress and release resources they are holding. The OOM victim should compensate for the reserves consumption. Signed-off-by: NMichal Hocko <mhocko@suse.com> Suggested-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
Use list_for_each_entry instead of list_for_each + list_entry to simplify the code. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Geliang Tang 提交于
To make the intention clearer, use list_{first,last}_entry instead of list_entry. Signed-off-by: NGeliang Tang <geliangtang@163.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") added an unnecessary and unused parameter to __rmqueue. It was a parameter that was used in an earlier version of the patch and then left behind. This patch cleans it up. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The dirty balance reserve that dirty throttling has to consider is merely memory not available to userspace allocations. There is nothing writeback-specific about it. Generalize the name so that it's reusable outside of that context. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__alloc_pages_slowpath is looping over ALLOC_NO_WATERMARKS requests if __GFP_NOFAIL is requested. This is fragile because we are basically relying on somebody else to make the reclaim (be it the direct reclaim or OOM killer) for us. The caller might be holding resources (e.g. locks) which block other other reclaimers from making any progress for example. Remove the retry loop and rely on __alloc_pages_slowpath to invoke all allowed reclaim steps and retry logic. We have to be careful about __GFP_NOFAIL allocations from the PF_MEMALLOC context even though this is a very bad idea to begin with because no progress can be gurateed at all. We shouldn't break the __GFP_NOFAIL semantic here though. It could be argued that this is essentially GFP_NOWAIT context which we do not support but PF_MEMALLOC is much harder to check for existing users because they might happen deep down the code path performed much later after setting the flag so we cannot really rule out there is no kernel path triggering this combination. Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
__alloc_pages_high_priority doesn't do anything special other than it calls get_page_from_freelist and loops around GFP_NOFAIL allocation until it succeeds. It would be better if the first part was done in __alloc_pages_slowpath where we modify the zonelist because this would be easier to read and understand. Opencoding the function into its only caller allows to simplify it a bit as well. This patch doesn't introduce any functional changes. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Hardcoding index to zonelists array in gfp_zonelist() is not a good idea, let's enumerate it to improve readability. No functional change. [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_NUMA=n build] [n-horiguchi@ah.jp.nec.com: fix warning in comparing enumerator] Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
Now, we have tracepoint in test_pages_isolated() to notify pfn which cannot be isolated. But, in alloc_contig_range(), some error path doesn't call test_pages_isolated() so it's still hard to know exact pfn that causes allocation failure. This patch change this situation by calling test_pages_isolated() in almost error path. In allocation failure case, some overhead is added by this change, but, allocation failure is really rare event so it would not matter. In fatal signal pending case, we don't call test_pages_isolated() because this failure is intentional one. There was a bogus outer_start problem due to unchecked buddy order and this patch also fix it. Before this patch, it didn't matter, because end result is same thing. But, after this patch, tracepoint will report failed pfn so it should be accurate. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: David Rientjes <rientjes@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2015 1 次提交
-
-
由 Vlastimil Babka 提交于
Commit 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE in the enum definition. However, migratetype_names wasn't updated to reflect that. As a result, the file /proc/pagetypeinfo shows the counts for Movable as Reclaimable and vice versa. Additionally, commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") introduced MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into show_migration_types(), so it doesn't appear in the listing of free areas during page alloc failures or oom kills. This patch fixes both problems. The atomic reserves will show with a letter 'H' in the free areas listings. Fixes: 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types") Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand") Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 11月, 2015 1 次提交
-
-
由 Tony Luck 提交于
In commit a1c34a3b ("mm: Don't offset memmap for flatmem") Laura fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM IFC6410 board. One small wrinkle on ia64 is that it allocates the node_mem_map earlier in arch code, so it skips the block of code where "offset" is initialized. Move initialization of start and offset before the check for the node_mem_map so that they will always be available in the latter part of the function. Tested-by: NLaura Abbott <laura@labbott.name> Fixes: a1c34a3b (mm: Don't offset memmap for flatmem) Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 11月, 2015 5 次提交
-
-
由 Kirill A. Shutemov 提交于
Let's try to be consistent about data type of page order. [sfr@canb.auug.org.au: fix build (type of pageblock_order)] [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types] Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Hugh has pointed that compound_head() call can be unsafe in some context. There's one example: CPU0 CPU1 isolate_migratepages_block() page_count() compound_head() !!PageTail() == true put_page() tail->first_page = NULL head = tail->first_page alloc_pages(__GFP_COMP) prep_compound_page() tail->first_page = head __SetPageTail(p); !!PageTail() == true <head == NULL dereferencing> The race is pure theoretical. I don't it's possible to trigger it in practice. But who knows. We can fix the race by changing how encode PageTail() and compound_head() within struct page to be able to update them in one shot. The patch introduces page->compound_head into third double word block in front of compound_dtor and compound_order. Bit 0 encodes PageTail() and the rest bits are pointer to head page if bit zero is set. The patch moves page->pmd_huge_pte out of word, just in case if an architecture defines pgtable_t into something what can have the bit 0 set. hugetlb_cgroup uses page->lru.next in the second tail page to store pointer struct hugetlb_cgroup. The patch switch it to use page->private in the second tail page instead. The space is free since ->first_page is removed from the union. The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER limitation, since there's now space in first tail page to store struct hugetlb_cgroup pointer. But that's out of scope of the patch. That means page->compound_head shares storage space with: - page->lru.next; - page->next; - page->rcu_head.next; That's too long list to be absolutely sure, but looks like nobody uses bit 0 of the word. page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future call_rcu_lazy() is not allowed as it makes use of the bit and we can get false positive PageTail(). [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
The patch halves space occupied by compound_dtor and compound_order in struct page. For compound_order, it's trivial long -> short conversion. For get_compound_page_dtor(), we now use hardcoded table for destructor lookup and store its index in the struct page instead of direct pointer to destructor. It shouldn't be a big trouble to maintain the table: we have only two destructor and NULL currently. This patch free up one word in tail pages for reuse. This is preparation for the next patch. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The primary purpose of watermarks is to ensure that reclaim can always make forward progress in PF_MEMALLOC context (kswapd and direct reclaim). These assume that order-0 allocations are all that is necessary for forward progress. High-order watermarks serve a different purpose. Kswapd had no high-order awareness before they were introduced (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was particularly important when there were high-order atomic requests. The watermarks both gave kswapd awareness and made a reserve for those atomic requests. There are two important side-effects of this. The most important is that a non-atomic high-order request can fail even though free pages are available and the order-0 watermarks are ok. The second is that high-order watermark checks are expensive as the free list counts up to the requested order must be examined. With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to have high-order watermarks. Kswapd and compaction still need high-order awareness which is handled by checking that at least one suitable high-order page is free. With the patch applied, there was little difference in the allocation failure rates as the atomic reserves are small relative to the number of allocation attempts. The expected impact is that there will never be an allocation failure report that shows suitable pages on the free lists. The one potential side-effect of this is that in a vanilla kernel, the watermark checks may have kept a free page for an atomic allocation. Now, we are 100% relying on the HighAtomic reserves and an early allocation to have allocated them. If the first high-order atomic allocation is after the system is already heavily fragmented then it'll fail. [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
High-order watermark checking exists for two reasons -- kswapd high-order awareness and protection for high-order atomic requests. Historically the kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic allocations on demand and avoids using those blocks for order-0 allocations. This is more flexible and reliable than MIGRATE_RESERVE was. A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation request steals a pageblock but limits the total number to 1% of the zone. Callers that speculatively abuse atomic allocations for long-lived high-order allocations to access the reserve will quickly fail. Note that SLUB is currently not such an abuser as it reclaims at least once. It is possible that the pageblock stolen has few suitable high-order pages and will need to steal again in the near future but there would need to be strong justification to search all pageblocks for an ideal candidate. The pageblocks are unreserved if an allocation fails after a direct reclaim attempt. The watermark checks account for the reserved pageblocks when the allocation request is not a high-order atomic allocation. The reserved pageblocks can not be used for order-0 allocations. This may allow temporary wastage until a failed reclaim reassigns the pageblock. This is deliberate as the intent of the reservation is to satisfy a limited number of atomic high-order short-lived requests if the system requires them. The stutter benchmark was used to evaluate this but while it was running there was a systemtap script that randomly allocated between 1 high-order page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This is much larger than the potential reserve and it does not attempt to be realistic. It is intended to stress random high-order allocations from an unknown source, show that there is a reduction in failures without introducing an anomaly where atomic allocations are more reliable than regular allocations. The amount of memory reserved varied throughout the workload as reserves were created and reclaimed under memory pressure. The allocation failures once the workload warmed up were as follows; 4.2-rc5-vanilla 70% 4.2-rc5-atomic-reserve 56% The failure rate was also measured while building multiple kernels. The failure rate was 14% but is 6% with this patch applied. Overall, this is a small reduction but the reserves are small relative to the number of allocation requests. In early versions of the patch, the failure rate reduced by a much larger amount but that required much larger reserves and perversely made atomic allocations seem more reliable than regular allocations. [yalin.wang2010@gmail.com: fix redundant check and a memory leak] Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vitaly Wool <vitalywool@gmail.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-