- 12 1月, 2013 2 次提交
-
-
由 Mel Gorman 提交于
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when waiting for POLLIN on a local TCP socket. It was easier to trigger if there was disk IO and dirty pages at the same time and he bisected it to commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available"). The intention of that patch was to improve high-order allocations under memory pressure after changes made to reclaim in 3.6 drastically hurt THP allocations but the approach was flawed. For Eric, the problem was that page->pfmemalloc was not being cleared for captured pages leading to a poor interaction with swap-over-NFS support causing the packets to be dropped. However, I identified a few more problems with the patch including the fact that it can increase contention on zone->lock in some cases which could result in async direct compaction being aborted early. In retrospect the capture patch took the wrong approach. What it should have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it was allocating for THP and avoided races that way. While the patch was showing to improve allocation success rates at the time, the benefit is marginal given the relative complexity and it should be revisited from scratch in the context of the other reclaim-related changes that have taken place since the patch was first written and tested. This patch partially reverts commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available"). Reported-and-tested-by: NEric Wong <normalperson@yhbt.net> Tested-by: NEric Dumazet <eric.dumazet@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Eric Wong reported on 3.7 and 3.8-rc2 that ppoll() got stuck when waiting for POLLIN on a local TCP socket. It was easier to trigger if there was disk IO and dirty pages at the same time and he bisected it to commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available"). The intention of that patch was to improve high-order allocations under memory pressure after changes made to reclaim in 3.6 drastically hurt THP allocations but the approach was flawed. For Eric, the problem was that page->pfmemalloc was not being cleared for captured pages leading to a poor interaction with swap-over-NFS support causing the packets to be dropped. However, I identified a few more problems with the patch including the fact that it can increase contention on zone->lock in some cases which could result in async direct compaction being aborted early. In retrospect the capture patch took the wrong approach. What it should have done is mark the pageblock being migrated as MIGRATE_ISOLATE if it was allocating for THP and avoided races that way. While the patch was showing to improve allocation success rates at the time, the benefit is marginal given the relative complexity and it should be revisited from scratch in the context of the other reclaim-related changes that have taken place since the patch was first written and tested. This patch partially reverts commit 1fb3f8ca "mm: compaction: capture a suitable high-order page immediately when it is made available". Reported-and-tested-by: NEric Wong <normalperson@yhbt.net> Tested-by: NEric Dumazet <eric.dumazet@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 1 次提交
-
-
由 Bob Liu 提交于
Several place need to find the pmd by(mm_struct, address), so introduce a function to simplify it. [akpm@linux-foundation.org: fix warning] Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: NMel Gorman <mgorman@suse.de>
-
- 09 10月, 2012 12 次提交
-
-
由 David Rientjes 提交于
NR_MLOCK is only accounted in single page units: there's no logic to handle transparent hugepages. This patch checks the appropriate number of pages to adjust the statistics by so that the correct amount of memory is reflected. Currently: $ grep Mlocked /proc/meminfo Mlocked: 19636 kB #define MAP_SIZE (4 << 30) /* 4GB */ void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); mlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 29844 kB munlock(ptr, MAP_SIZE); $ grep Mlocked /proc/meminfo Mlocked: 19636 kB And with this patch: $ grep Mlock /proc/meminfo Mlocked: 19636 kB mlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 4213664 kB munlock(ptr, MAP_SIZE); $ grep Mlock /proc/meminfo Mlocked: 19636 kB Signed-off-by: NDavid Rientjes <rientjes@google.com> Reported-by: NHugh Dickens <hughd@google.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Presently CMA cannot migrate mlocked pages so it ends up failing to allocate contiguous memory space. This patch makes mlocked pages be migrated out. Of course, it can affect realtime processes but in CMA usecase, contiguous memory allocation failing is far worse than access latency to an mlocked page being variable while CMA is running. If someone wants to make the system realtime, he shouldn't enable CMA because stalls can still happen at random times. [akpm@linux-foundation.org: tweak comment text, per Mel] Signed-off-by: NMinchan Kim <minchan@kernel.org> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
We had thought that pages could no longer get freed while still marked as mlocked; but Johannes Weiner posted this program to demonstrate that truncating an mlocked private file mapping containing COWed pages is still mishandled: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <stdlib.h> #include <unistd.h> #include <fcntl.h> #include <stdio.h> int main(void) { char *map; int fd; system("grep mlockfreed /proc/vmstat"); fd = open("chigurh", O_CREAT|O_EXCL|O_RDWR); unlink("chigurh"); ftruncate(fd, 4096); map = mmap(NULL, 4096, PROT_WRITE, MAP_PRIVATE, fd, 0); map[0] = 11; mlock(map, sizeof(fd)); ftruncate(fd, 0); close(fd); munlock(map, sizeof(fd)); munmap(map, 4096); system("grep mlockfreed /proc/vmstat"); return 0; } The anon COWed pages are not caught by truncation's clear_page_mlock() of the pagecache pages; but unmap_mapping_range() unmaps them, so we ought to look out for them there in page_remove_rmap(). Indeed, why should truncation or invalidation be doing the clear_page_mlock() when removing from pagecache? mlock is a property of mapping in userspace, not a property of pagecache: an mlocked unmapped page is nonsensical. Reported-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This is almost entirely based on Rik's previous patches and discussions with him about how this might be implemented. Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. This patch caches where the migration and free scanner should start from on subsequent compaction invocations using the pageblock-skip information. When compaction starts it begins from the cached restart points and will update the cached restart points until a page is isolated or a pageblock is skipped that would have been scanned by synchronous compaction. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: NRafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
When compaction was implemented it was known that scanning could potentially be excessive. The ideal was that a counter be maintained for each pageblock but maintaining this information would incur a severe penalty due to a shared writable cache line. It has reached the point where the scanning costs are a serious problem, particularly on long-lived systems where a large process starts and allocates a large number of THPs at the same time. Instead of using a shared counter, this patch adds another bit to the pageblock flags called PG_migrate_skip. If a pageblock is scanned by either migrate or free scanner and 0 pages were isolated, the pageblock is marked to be skipped in the future. When scanning, this bit is checked before any scanning takes place and the block skipped if set. The main difficulty with a patch like this is "when to ignore the cached information?" If it's ignored too often, the scanning rates will still be excessive. If the information is too stale then allocations will fail that might have otherwise succeeded. In this patch o CMA always ignores the information o If the migrate and free scanner meet then the cached information will be discarded if it's at least 5 seconds since the last time the cache was discarded o If there are a large number of allocation failures, discard the cache. The time-based heuristic is very clumsy but there are few choices for a better event. Depending solely on multiple allocation failures still allows excessive scanning when THP allocations are failing in quick succession due to memory pressure. Waiting until memory pressure is relieved would cause compaction to continually fail instead of using reclaim/compaction to try allocate the page. The time-based mechanism is clumsy but a better option is not obvious. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: NRafael Aquini <aquini@redhat.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This reverts commit 7db8889a ("mm: have order > 0 compaction start off where it left") and commit de74f1cc ("mm: have order > 0 compaction start near a pageblock with free pages"). These patches were a good idea and tests confirmed that they massively reduced the amount of scanning but the implementation is complex and tricky to understand. A later patch will cache what pageblocks should be skipped and reimplements the concept of compact_cached_free_pfn on top for both migration and free scanners. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com> Cc: Richard Davies <richard@arachsys.com> Cc: Shaohua Li <shli@kernel.org> Cc: Avi Kivity <avi@redhat.com> Acked-by: NRafael Aquini <aquini@redhat.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
isolate_migratepages_range() might isolate no pages if for example when zone->lru_lock is contended and running asynchronous compaction. In this case, we should abort compaction, otherwise, compact_zone will run a useless loop and make zone->lru_lock is even contended. An additional check is added to ensure that cc.migratepages and cc.freepages get properly drained whan compaction is aborted. [minchan@kernel.org: Putback pages isolated for migration if aborting] [akpm@linux-foundation.org: compact_zone_order requires non-NULL arg contended] [akpm@linux-foundation.org: make compact_zone_order() require non-NULL arg `contended'] [minchan@kernel.org: Putback pages isolated for migration if aborting] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NShaohua Li <shli@fusionio.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
* Add ALLOC_CMA alloc flag and pass it to [__]zone_watermark_ok() (from Minchan Kim). * During watermark check decrease available free pages number by free CMA pages number if necessary (unmovable allocations cannot use pages from CMA areas). Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Drop clean cache pages instead of migration during alloc_contig_range() to minimise allocation latency by reducing the amount of migration that is necessary. It's useful for CMA because latency of migration is more important than evicting the background process's working set. In addition, as pages are reclaimed then fewer free pages for migration targets are required so it avoids memory reclaiming to get free pages, which is a contributory factor to increased latency. I measured elapsed time of __alloc_contig_migrate_range() which migrates 10M in 40M movable zone in QEMU machine. Before - 146ms, After - 7ms [akpm@linux-foundation.org: fix nommu build] Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NMel Gorman <mgorman@suse.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Acked-by: NMichal Nazarewicz <mina86@mina86.com> Cc: Rik van Riel <riel@redhat.com> Tested-by: NKyungmin Park <kyungmin.park@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
Make sure the #endif that terminates the standard #ifndef / #define / #endif construct gets labeled, and gets positioned at the end of the file as is normally the case. Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
While compaction is migrating pages to free up large contiguous blocks for allocation it races with other allocation requests that may steal these blocks or break them up. This patch alters direct compaction to capture a suitable free page as soon as it becomes available to reduce this race. It uses similar logic to split_free_page() to ensure that watermarks are still obeyed. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 8月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Jim Schutt reported a problem that pointed at compaction contending heavily on locks. The workload is straight-forward and in his own words; The systems in question have 24 SAS drives spread across 3 HBAs, running 24 Ceph OSD instances, one per drive. FWIW these servers are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160 Ceph Linux clients doing dd simultaneously to a Ceph file system backed by 12 of these servers. Early in the test everything looks fine procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0 and then it goes to pot procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu------- r b swpd free buff cache si so bi bo in cs us sy id wa st 163 8 0 464308 576 36791368 0 0 11 22210 866 536 3 13 79 4 0 207 14 0 917752 576 36181928 0 0 712 1345376 134598 47367 7 90 1 2 0 123 12 0 685516 576 36296148 0 0 429 1386615 158494 60077 8 84 5 3 0 123 12 0 598572 576 36333728 0 0 1107 1233281 147542 62351 7 84 5 4 0 622 7 0 660768 576 36118264 0 0 557 1345548 151394 59353 7 85 4 3 0 223 11 0 283960 576 36463868 0 0 46 1107160 121846 33006 6 93 1 1 0 Note that system CPU usage is very high blocks being written out has dropped by 42%. He analysed this with perf and found perf record -g -a sleep 10 perf report --sort symbol --call-graph fractal,5 34.63% [k] _raw_spin_lock_irqsave | |--97.30%-- isolate_freepages | compaction_alloc | unmap_and_move | migrate_pages | compact_zone | compact_zone_order | try_to_compact_pages | __alloc_pages_direct_compact | __alloc_pages_slowpath | __alloc_pages_nodemask | alloc_pages_vma | do_huge_pmd_anonymous_page | handle_mm_fault | do_page_fault | page_fault | | | |--87.39%-- skb_copy_datagram_iovec | | tcp_recvmsg | | inet_recvmsg | | sock_recvmsg | | sys_recvfrom | | system_call | | __recv | | | | | --100.00%-- (nil) | | | --12.61%-- memcpy --2.70%-- [...] There was other data but primarily it is all showing that compaction is contended heavily on the zone->lock and zone->lru_lock. commit [b2eef8c0: mm: compaction: minimise the time IRQs are disabled while isolating pages for migration] noted that it was possible for migration to hold the lru_lock for an excessive amount of time. Very broadly speaking this patch expands the concept. This patch introduces compact_checklock_irqsave() to check if a lock is contended or the process needs to be scheduled. If either condition is true then async compaction is aborted and the caller is informed. The page allocator will fail a THP allocation if compaction failed due to contention. This patch also introduces compact_trylock_irqsave() which will acquire the lock only if it is not contended and the process does not need to schedule. Reported-by: NJim Schutt <jaschut@sandia.gov> Tested-by: NJim Schutt <jaschut@sandia.gov> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 8月, 2012 4 次提交
-
-
由 Mel Gorman 提交于
Change the skb allocation API to indicate RX usage and use this to fall back to the PFMEMALLOC reserve when needed. SKBs allocated from the reserve are tagged in skb->pfmemalloc. If an SKB is allocated from the reserve and the socket is later found to be unrelated to page reclaim, the packet is dropped so that the memory remains available for page reclaim. Network protocols are expected to recover from this packet loss. [a.p.zijlstra@chello.nl: Ideas taken from various patches] [davem@davemloft.net: Use static branches, coding style corrections] [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build] Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid S. Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
When a user or administrator requires swap for their application, they create a swap partition and file, format it with mkswap and activate it with swapon. Swap over the network is considered as an option in diskless systems. The two likely scenarios are when blade servers are used as part of a cluster where the form factor or maintenance costs do not allow the use of disks and thin clients. The Linux Terminal Server Project recommends the use of the Network Block Device (NBD) for swap according to the manual at https://sourceforge.net/projects/ltsp/files/Docs-Admin-Guide/LTSPManual.pdf/download There is also documentation and tutorials on how to setup swap over NBD at places like https://help.ubuntu.com/community/UbuntuLTSP/EnableNBDSWAP The nbd-client also documents the use of NBD as swap. Despite this, the fact is that a machine using NBD for swap can deadlock within minutes if swap is used intensively. This patch series addresses the problem. The core issue is that network block devices do not use mempools like normal block devices do. As the host cannot control where they receive packets from, they cannot reliably work out in advance how much memory they might need. Some years ago, Peter Zijlstra developed a series of patches that supported swap over an NFS that at least one distribution is carrying within their kernels. This patch series borrows very heavily from Peter's work to support swapping over NBD as a pre-requisite to supporting swap-over-NFS. The bulk of the complexity is concerned with preserving memory that is allocated from the PFMEMALLOC reserves for use by the network layer which is needed for both NBD and NFS. Patch 1 adds knowledge of the PFMEMALLOC reserves to SLAB and SLUB to preserve access to pages allocated under low memory situations to callers that are freeing memory. Patch 2 optimises the SLUB fast path to avoid pfmemalloc checks Patch 3 introduces __GFP_MEMALLOC to allow access to the PFMEMALLOC reserves without setting PFMEMALLOC. Patch 4 opens the possibility for softirqs to use PFMEMALLOC reserves for later use by network packet processing. Patch 5 only sets page->pfmemalloc when ALLOC_NO_WATERMARKS was required Patch 6 ignores memory policies when ALLOC_NO_WATERMARKS is set. Patches 7-12 allows network processing to use PFMEMALLOC reserves when the socket has been marked as being used by the VM to clean pages. If packets are received and stored in pages that were allocated under low-memory situations and are unrelated to the VM, the packets are dropped. Patch 11 reintroduces __skb_alloc_page which the networking folk may object to but is needed in some cases to propogate pfmemalloc from a newly allocated page to an skb. If there is a strong objection, this patch can be dropped with the impact being that swap-over-network will be slower in some cases but it should not fail. Patch 13 is a micro-optimisation to avoid a function call in the common case. Patch 14 tags NBD sockets as being SOCK_MEMALLOC so they can use PFMEMALLOC if necessary. Patch 15 notes that it is still possible for the PFMEMALLOC reserve to be depleted. To prevent this, direct reclaimers get throttled on a waitqueue if 50% of the PFMEMALLOC reserves are depleted. It is expected that kswapd and the direct reclaimers already running will clean enough pages for the low watermark to be reached and the throttled processes are woken up. Patch 16 adds a statistic to track how often processes get throttled Some basic performance testing was run using kernel builds, netperf on loopback for UDP and TCP, hackbench (pipes and sockets), iozone and sysbench. Each of them were expected to use the sl*b allocators reasonably heavily but there did not appear to be significant performance variances. For testing swap-over-NBD, a machine was booted with 2G of RAM with a swapfile backed by NBD. 8*NUM_CPU processes were started that create anonymous memory mappings and read them linearly in a loop. The total size of the mappings were 4*PHYSICAL_MEMORY to use swap heavily under memory pressure. Without the patches and using SLUB, the machine locks up within minutes and runs to completion with them applied. With SLAB, the story is different as an unpatched kernel run to completion. However, the patched kernel completed the test 45% faster. MICRO 3.5.0-rc2 3.5.0-rc2 vanilla swapnbd Unrecognised test vmscan-anon-mmap-write MMTests Statistics: duration Sys Time Running Test (seconds) 197.80 173.07 User+Sys Time Running Test (seconds) 206.96 182.03 Total Elapsed Time (seconds) 3240.70 1762.09 This patch: mm: sl[au]b: add knowledge of PFMEMALLOC reserve pages Allocations of pages below the min watermark run a risk of the machine hanging due to a lack of memory. To prevent this, only callers who have PF_MEMALLOC or TIF_MEMDIE set and are not processing an interrupt are allowed to allocate with ALLOC_NO_WATERMARKS. Once they are allocated to a slab though, nothing prevents other callers consuming free objects within those slabs. This patch limits access to slab pages that were alloced from the PFMEMALLOC reserves. When this patch is applied, pages allocated from below the low watermark are returned with page->pfmemalloc set and it is up to the caller to determine how the page should be protected. SLAB restricts access to any page with page->pfmemalloc set to callers which are known to able to access the PFMEMALLOC reserve. If one is not available, an attempt is made to allocate a new page rather than use a reserve. SLUB is a bit more relaxed in that it only records if the current per-CPU page was allocated from PFMEMALLOC reserve and uses another partial slab if the caller does not have the necessary GFP or process flags. This was found to be sufficient in tests to avoid hangs due to SLUB generally maintaining smaller lists than SLAB. In low-memory conditions it does mean that !PFMEMALLOC allocators can fail a slab allocation even though free objects are available because they are being preserved for callers that are freeing pages. [a.p.zijlstra@chello.nl: Original implementation] [sebastian@breakpoint.cc: Correct order of page flag clearing] Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xishi Qiu 提交于
On architectures with CONFIG_HUGETLB_PAGE_SIZE_VARIABLE set, such as Itanium, pageblock_order is a variable with default value of 0. It's set to the right value by set_pageblock_order() in function free_area_init_core(). But pageblock_order may be used by sparse_init() before free_area_init_core() is called along path: sparse_init() ->sparse_early_usemaps_alloc_node() ->usemap_size() ->SECTION_BLOCKFLAGS_BITS ->((1UL << (PFN_SECTION_SHIFT - pageblock_order)) * NR_PAGEBLOCK_BITS) The uninitialized pageblock_size will cause memory wasting because usemap_size() returns a much bigger value then it's really needed. For example, on an Itanium platform, sparse_init() pageblock_order=0 usemap_size=24576 free_area_init_core() before pageblock_order=0, usemap_size=24576 free_area_init_core() after pageblock_order=12, usemap_size=8 That means 24K memory has been wasted for each section, so fix it by calling set_pageblock_order() from sparse_init(). Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Signed-off-by: NJiang Liu <liuj97@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Keping Chen <chenkeping@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
Order > 0 compaction stops when enough free pages of the correct page order have been coalesced. When doing subsequent higher order allocations, it is possible for compaction to be invoked many times. However, the compaction code always starts out looking for things to compact at the start of the zone, and for free pages to compact things to at the end of the zone. This can cause quadratic behaviour, with isolate_freepages starting at the end of the zone each time, even though previous invocations of the compaction code already filled up all free memory on that end of the zone. This can cause isolate_freepages to take enormous amounts of CPU with certain workloads on larger memory systems. The obvious solution is to have isolate_freepages remember where it left off last time, and continue at that point the next time it gets invoked for an order > 0 compaction. This could cause compaction to fail if cc->free_pfn and cc->migrate_pfn are close together initially, in that case we restart from the end of the zone and try once more. Forced full (order == -1) compactions are left alone. [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: s/laste/last/, use 80 cols] Signed-off-by: NRik van Riel <riel@redhat.com> Reported-by: NJim Schutt <jaschut@sandia.gov> Tested-by: NJim Schutt <jaschut@sandia.gov> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 6月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 5ceb9ce6. That commit seems to be the cause of the mm compation list corruption issues that Dave Jones reported. The locking (or rather, absense there-of) is dubious, as is the use of the 'page' variable once it has been found to be outside the pageblock range. So revert it for now, we can re-visit this for 3.6. If we even need to: as Minchan Kim says, "The patch wasn't a bug fix and even test workload was very theoretical". Reported-and-tested-by: NDave Jones <davej@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 6月, 2012 1 次提交
-
-
由 Al Viro 提交于
take it to mm/util.c, convert vm_mmap() to use of that one and take it to mm/util.c as well, convert both sys_mmap_pgoff() to use of vm_mmap_pgoff() Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 30 5月, 2012 2 次提交
-
-
When MIGRATE_UNMOVABLE pages are freed from MIGRATE_UNMOVABLE type pageblock (and some MIGRATE_MOVABLE pages are left in it) waiting until an allocation takes ownership of the block may take too long. The type of the pageblock remains unchanged so the pageblock cannot be used as a migration target during compaction. Fix it by: * Adding enum compact_mode (COMPACT_ASYNC_[MOVABLE,UNMOVABLE], and COMPACT_SYNC) and then converting sync field in struct compact_control to use it. * Adding nr_pageblocks_skipped field to struct compact_control and tracking how many destination pageblocks were of MIGRATE_UNMOVABLE type. If COMPACT_ASYNC_MOVABLE mode compaction ran fully in try_to_compact_pages() (COMPACT_COMPLETE) it implies that there is not a suitable page for allocation. In this case then check how if there were enough MIGRATE_UNMOVABLE pageblocks to try a second pass in COMPACT_ASYNC_UNMOVABLE mode. * Scanning the MIGRATE_UNMOVABLE pageblocks (during COMPACT_SYNC and COMPACT_ASYNC_UNMOVABLE compaction modes) and building a count based on finding PageBuddy pages, page_count(page) == 0 or PageLRU pages. If all pages within the MIGRATE_UNMOVABLE pageblock are in one of those three sets change the whole pageblock type to MIGRATE_MOVABLE. My particular test case (on a ARM EXYNOS4 device with 512 MiB, which means 131072 standard 4KiB pages in 'Normal' zone) is to: - allocate 120000 pages for kernel's usage - free every second page (60000 pages) of memory just allocated - allocate and use 60000 pages from user space - free remaining 60000 pages of kernel memory (now we have fragmented memory occupied mostly by user space pages) - try to allocate 100 order-9 (2048 KiB) pages for kernel's usage The results: - with compaction disabled I get 11 successful allocations - with compaction enabled - 14 successful allocations - with this patch I'm able to get all 100 successful allocations NOTE: If we can make kswapd aware of order-0 request during compaction, we can enhance kswapd with changing mode to COMPACT_ASYNC_FULL (COMPACT_ASYNC_MOVABLE + COMPACT_ASYNC_UNMOVABLE). Please see the following thread: http://marc.info/?l=linux-mm&m=133552069417068&w=2 [minchan@kernel.org: minor cleanups] Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ying Han 提交于
Andrew pointed out that the is_mlocked_vma() is misnamed. A function with name like that would expect bool return and no side-effects. Since it is called on the fault path for new page, rename it in this patch. Signed-off-by: NYing Han <yinghan@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> [akpm@linux-foundation.org: s/mlock_vma_newpage/mlock_vma_newpage/, per Minchan] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 5月, 2012 1 次提交
-
-
由 Michal Nazarewicz 提交于
This commit exports some of the functions from compaction.c file outside of it adding their declaration into internal.h header file so that other mm related code can use them. This forced compaction.c to always be compiled (as opposed to being compiled only if CONFIG_COMPACTION is defined) but as to avoid introducing code that user did not ask for, part of the compaction.c is now wrapped in on #ifdef. Signed-off-by: NMichal Nazarewicz <mina86@mina86.com> Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: NRob Clark <rob.clark@linaro.org> Tested-by: NOhad Ben-Cohen <ohad@wizery.com> Tested-by: NBenjamin Gaignard <benjamin.gaignard@linaro.org> Tested-by: NRobert Nelson <robertcnelson@gmail.com> Tested-by: NBarry Song <Baohua.Song@csr.com>
-
- 03 11月, 2011 1 次提交
-
-
由 Andrea Arcangeli 提交于
Michel while working on the working set estimation code, noticed that calling get_page_unless_zero() on a random pfn_to_page(random_pfn) wasn't safe, if the pfn ended up being a tail page of a transparent hugepage under splitting by __split_huge_page_refcount(). He then found the problem could also theoretically materialize with page_cache_get_speculative() during the speculative radix tree lookups that uses get_page_unless_zero() in SMP if the radix tree page is freed and reallocated and get_user_pages is called on it before page_cache_get_speculative has a chance to call get_page_unless_zero(). So the best way to fix the problem is to keep page_tail->_count zero at all times. This will guarantee that get_page_unless_zero() can never succeed on any tail page. page_tail->_mapcount is guaranteed zero and is unused for all tail pages of a compound page, so we can simply account the tail page references there and transfer them to tail_page->_count in __split_huge_page_refcount() (in addition to the head_page->_mapcount). While debugging this s/_count/_mapcount/ change I also noticed get_page is called by direct-io.c on pages returned by get_user_pages. That wasn't entirely safe because the two atomic_inc in get_page weren't atomic. As opposed to other get_user_page users like secondary-MMU page fault to establish the shadow pagetables would never call any superflous get_page after get_user_page returns. It's safer to make get_page universally safe for tail pages and to use get_page_foll() within follow_page (inside get_user_pages()). get_page_foll() is safe to do the refcounting for tail pages without taking any locks because it is run within PT lock protected critical sections (PT lock for pte and page_table_lock for pmd_trans_huge). The standard get_page() as invoked by direct-io instead will now take the compound_lock but still only for tail pages. The direct-io paths are usually I/O bound and the compound_lock is per THP so very finegrined, so there's no risk of scalability issues with it. A simple direct-io benchmarks with all lockdep prove locking and spinlock debugging infrastructure enabled shows identical performance and no overhead. So it's worth it. Ideally direct-io should stop calling get_page() on pages returned by get_user_pages(). The spinlock in get_page() is already optimized away for no-THP builds but doing get_page() on tail pages returned by GUP is generally a rare operation and usually only run in I/O paths. This new refcounting on page_tail->_mapcount in addition to avoiding new RCU critical sections will also allow the working set estimation code to work without any further complexity associated to the tail page refcounting with THP. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NMichel Lespinasse <walken@google.com> Reviewed-by: NMichel Lespinasse <walken@google.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: <stable@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 5月, 2011 1 次提交
-
-
由 Namhyung Kim 提交于
When I was reading nommu code, I found that it handles the vma list/tree in an unusual way. IIUC, because there can be more than one identical/overrapped vmas in the list/tree, it sorts the tree more strictly and does a linear search on the tree. But it doesn't applied to the list (i.e. the list could be constructed in a different order than the tree so that we can't use the list when finding the first vma in that order). Since inserting/sorting a vma in the tree and link is done at the same time, we can easily construct both of them in the same order. And linear searching on the tree could be more costly than doing it on the list, it can be converted to use the list. Also, after the commit 297c5eee ("mm: make the vma list be doubly linked") made the list be doubly linked, there were a couple of code need to be fixed to construct the list properly. Patch 1/6 is a preparation. It maintains the list sorted same as the tree and construct doubly-linked list properly. Patch 2/6 is a simple optimization for the vma deletion. Patch 3/6 and 4/6 convert tree traversal to list traversal and the rest are simple fixes and cleanups. This patch: @vma added into @mm should be sorted by start addr, end addr and VMA struct addr in that order because we may get identical VMAs in the @mm. However this was true only for the rbtree, not for the list. This patch fixes this by remembering 'rb_prev' during the tree traversal like find_vma_prepare() does and linking the @vma via __vma_link_list(). After this patch, we can iterate the whole VMAs in correct order simply by using @mm->mmap list. [akpm@linux-foundation.org: avoid duplicating __vma_link_list()] Signed-off-by: NNamhyung Kim <namhyung@gmail.com> Acked-by: NGreg Ungerer <gerg@uclinux.org> Cc: David Howells <dhowells@redhat.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 3月, 2011 1 次提交
-
-
由 Lucas De Marchi 提交于
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
-
- 18 3月, 2011 1 次提交
-
-
由 Huang Ying 提交于
In most cases, get_user_pages and get_user_pages_fast should be used to pin user pages in memory. But sometimes, some special flags except FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in following patch, KVM needs FOLL_HWPOISON. To support these users, __get_user_pages is exported directly. There are some symbol name conflicts in infiniband driver, fixed them too. Signed-off-by: NHuang Ying <ying.huang@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Michel Lespinasse <walken@google.com> CC: Roland Dreier <roland@kernel.org> CC: Ralph Campbell <infinipath@qlogic.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 18 1月, 2011 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 744ed144. Chris Mason ended up chasing down some page allocation errors and pages stuck waiting on the IO scheduler, and was able to narrow it down to two commits: commit 744ed144 ("mm: batch activate_page() to reduce lock contention") and d8505dee ("mm: simplify code of swap.c"). This reverts the first of them. Reported-and-debugged-by: NChris Mason <chris.mason@oracle.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jaxboe@fusionio.com> Cc: linux-mm <linux-mm@kvack.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Shaohua Li <shaohua.li@intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2011 3 次提交
-
-
由 Shaohua Li 提交于
The zone->lru_lock is heavily contented in workload where activate_page() is frequently used. We could do batch activate_page() to reduce the lock contention. The batched pages will be added into zone list when the pool is full or page reclaim is trying to drain them. For example, in a 4 socket 64 CPU system, create a sparse file and 64 processes, processes shared map to the file. Each process read access the whole file and then exit. The process exit will do unmap_vmas() and cause a lot of activate_page() call. In such workload, we saw about 58% total time reduction with below patch. Other workloads with a lot of activate_page also benefits a lot too. I tested some microbenchmarks: case-anon-cow-rand-mt 0.58% case-anon-cow-rand -3.30% case-anon-cow-seq-mt -0.51% case-anon-cow-seq -5.68% case-anon-r-rand-mt 0.23% case-anon-r-rand 0.81% case-anon-r-seq-mt -0.71% case-anon-r-seq -1.99% case-anon-rx-rand-mt 2.11% case-anon-rx-seq-mt 3.46% case-anon-w-rand-mt -0.03% case-anon-w-rand -0.50% case-anon-w-seq-mt -1.08% case-anon-w-seq -0.12% case-anon-wx-rand-mt -5.02% case-anon-wx-seq-mt -1.43% case-fork 1.65% case-fork-sleep -0.07% case-fork-withmem 1.39% case-hugetlb -0.59% case-lru-file-mmap-read-mt -0.54% case-lru-file-mmap-read 0.61% case-lru-file-mmap-read-rand -2.24% case-lru-file-readonce -0.64% case-lru-file-readtwice -11.69% case-lru-memcg -1.35% case-mmap-pread-rand-mt 1.88% case-mmap-pread-rand -15.26% case-mmap-pread-seq-mt 0.89% case-mmap-pread-seq -69.72% case-mmap-xread-rand-mt 0.71% case-mmap-xread-seq-mt 0.38% The most significent are: case-lru-file-readtwice -11.69% case-mmap-pread-rand -15.26% case-mmap-pread-seq -69.72% which use activate_page a lot. others are basically variations because each run has slightly difference. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NShaohua Li <shaohua.li@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Lately I've been working to make KVM use hugepages transparently without the usual restrictions of hugetlbfs. Some of the restrictions I'd like to see removed: 1) hugepages have to be swappable or the guest physical memory remains locked in RAM and can't be paged out to swap 2) if a hugepage allocation fails, regular pages should be allocated instead and mixed in the same vma without any failure and without userland noticing 3) if some task quits and more hugepages become available in the buddy, guest physical memory backed by regular pages should be relocated on hugepages automatically in regions under madvise(MADV_HUGEPAGE) (ideally event driven by waking up the kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes not null) 4) avoidance of reservation and maximization of use of hugepages whenever possible. Reservation (needed to avoid runtime fatal faliures) may be ok for 1 machine with 1 database with 1 database cache with 1 database cache size known at boot time. It's definitely not feasible with a virtualization hypervisor usage like RHEV-H that runs an unknown number of virtual machines with an unknown size of each virtual machine with an unknown amount of pagecache that could be potentially useful in the host for guest not using O_DIRECT (aka cache=off). hugepages in the virtualization hypervisor (and also in the guest!) are much more important than in a regular host not using virtualization, becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24 to 19 in case only the hypervisor uses transparent hugepages, and they decrease the tlb-miss cacheline accesses from 19 to 15 in case both the linux hypervisor and the linux guest both uses this patch (though the guest will limit the addition speedup to anonymous regions only for now...). Even more important is that the tlb miss handler is much slower on a NPT/EPT guest than for a regular shadow paging or no-virtualization scenario. So maximizing the amount of virtual memory cached by the TLB pays off significantly more with NPT/EPT than without (even if there would be no significant speedup in the tlb-miss runtime). The first (and more tedious) part of this work requires allowing the VM to handle anonymous hugepages mixed with regular pages transparently on regular anonymous vmas. This is what this patch tries to achieve in the least intrusive possible way. We want hugepages and hugetlb to be used in a way so that all applications can benefit without changes (as usual we leverage the KVM virtualization design: by improving the Linux VM at large, KVM gets the performance boost too). The most important design choice is: always fallback to 4k allocation if the hugepage allocation fails! This is the _very_ opposite of some large pagecache patches that failed with -EIO back then if a 64k (or similar) allocation failed... Second important decision (to reduce the impact of the feature on the existing pagetable handling code) is that at any time we can split an hugepage into 512 regular pages and it has to be done with an operation that can't fail. This way the reliability of the swapping isn't decreased (no need to allocate memory when we are short on memory to swap) and it's trivial to plug a split_huge_page* one-liner where needed without polluting the VM. Over time we can teach mprotect, mremap and friends to handle pmd_trans_huge natively without calling split_huge_page*. The fact it can't fail isn't just for swap: if split_huge_page would return -ENOMEM (instead of the current void) we'd need to rollback the mprotect from the middle of it (ideally including undoing the split_vma) which would be a big change and in the very wrong direction (it'd likely be simpler not to call split_huge_page at all and to teach mprotect and friends to handle hugepages instead of rolling them back from the middle). In short the very value of split_huge_page is that it can't fail. The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and incremental and it'll just be an "harmless" addition later if this initial part is agreed upon. It also should be noted that locking-wise replacing regular pages with hugepages is going to be very easy if compared to what I'm doing below in split_huge_page, as it will only happen when page_count(page) matches page_mapcount(page) if we can take the PG_lock and mmap_sem in write mode. collapse_huge_page will be a "best effort" that (unlike split_huge_page) can fail at the minimal sign of trouble and we can try again later. collapse_huge_page will be similar to how KSM works and the madvise(MADV_HUGEPAGE) will work similar to madvise(MADV_MERGEABLE). The default I like is that transparent hugepages are used at page fault time. This can be changed with /sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set to three values "always", "madvise", "never" which mean respectively that hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions, or never used. /sys/kernel/mm/transparent_hugepage/defrag instead controls if the hugepage allocation should defrag memory aggressively "always", only inside "madvise" regions, or "never". The pmd_trans_splitting/pmd_trans_huge locking is very solid. The put_page (from get_user_page users that can't use mmu notifier like O_DIRECT) that runs against a __split_huge_page_refcount instead was a pain to serialize in a way that would result always in a coherent page count for both tail and head. I think my locking solution with a compound_lock taken only after the page_first is valid and is still a PageHead should be safe but it surely needs review from SMP race point of view. In short there is no current existing way to serialize the O_DIRECT final put_page against split_huge_page_refcount so I had to invent a new one (O_DIRECT loses knowledge on the mapping status by the time gup_fast returns so...). And I didn't want to impact all gup/gup_fast users for now, maybe if we change the gup interface substantially we can avoid this locking, I admit I didn't think too much about it because changing the gup unpinning interface would be invasive. If we ignored O_DIRECT we could stick to the existing compound refcounting code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM (and any other mmu notifier user) would call it without FOLL_GET (and if FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the current task mmu notifier list yet). But O_DIRECT is fundamental for decent performance of virtualized I/O on fast storage so we can't avoid it to solve the race of put_page against split_huge_page_refcount to achieve a complete hugepage feature for KVM. Swap and oom works fine (well just like with regular pages ;). MMU notifier is handled transparently too, with the exception of the young bit on the pmd, that didn't have a range check but I think KVM will be fine because the whole point of hugepages is that EPT/NPT will also use a huge pmd when they notice gup returns pages with PageCompound set, so they won't care of a range and there's just the pmd young bit to check in that case. NOTE: in some cases if the L2 cache is small, this may slowdown and waste memory during COWs because 4M of memory are accessed in a single fault instead of 8k (the payoff is that after COW the program can run faster). So we might want to switch the copy_huge_page (and clear_huge_page too) to not temporal stores. I also extensively researched ways to avoid this cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k up to 1M (I can send those patches that fully implemented prefault) but I concluded they're not worth it and they add an huge additional complexity and they remove all tlb benefits until the full hugepage has been faulted in, to save a little bit of memory and some cache during app startup, but they still don't improve substantially the cache-trashing during startup if the prefault happens in >4k chunks. One reason is that those 4k pte entries copied are still mapped on a perfectly cache-colored hugepage, so the trashing is the worst one can generate in those copies (cow of 4k page copies aren't so well colored so they trashes less, but again this results in software running faster after the page fault). Those prefault patches allowed things like a pte where post-cow pages were local 4k regular anon pages and the not-yet-cowed pte entries were pointing in the middle of some hugepage mapped read-only. If it doesn't payoff substantially with todays hardware it will payoff even less in the future with larger l2 caches, and the prefault logic would blot the VM a lot. If one is emebdded transparent_hugepage can be disabled during boot with sysfs or with the boot commandline parameter transparent_hugepage=0 (or transparent_hugepage=2 to restrict hugepages inside madvise regions) that will ensure not a single hugepage is allocated at boot time. It is simple enough to just disable transparent hugepage globally and let transparent hugepages be allocated selectively by applications in the MADV_HUGEPAGE region (both at page fault time, and if enabled with the collapse_huge_page too through the kernel daemon). This patch supports only hugepages mapped in the pmd, archs that have smaller hugepages will not fit in this patch alone. Also some archs like power have certain tlb limits that prevents mixing different page size in the same regions so they will not fit in this framework that requires "graceful fallback" to basic PAGE_SIZE in case of physical memory fragmentation. hugetlbfs remains a perfect fit for those because its software limits happen to match the hardware limits. hugetlbfs also remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped to be found not fragmented after a certain system uptime and that would be very expensive to defragment with relocation, so requiring reservation. hugetlbfs is the "reservation way", the point of transparent hugepages is not to have any reservation at all and maximizing the use of cache and hugepages at all times automatically. Some performance result: vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep ages3 memset page fault 1566023 memset tlb miss 453854 memset second tlb miss 453321 random access tlb miss 41635 random access second tlb miss 41658 vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3 memset page fault 1566471 memset tlb miss 453375 memset second tlb miss 453320 random access tlb miss 41636 random access second tlb miss 41637 vmx andrea # ./largepages3 memset page fault 1566642 memset tlb miss 453417 memset second tlb miss 453313 random access tlb miss 41630 random access second tlb miss 41647 vmx andrea # ./largepages3 memset page fault 1566872 memset tlb miss 453418 memset second tlb miss 453315 random access tlb miss 41618 random access second tlb miss 41659 vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage vmx andrea # ./largepages3 memset page fault 2182476 memset tlb miss 460305 memset second tlb miss 460179 random access tlb miss 44483 random access second tlb miss 44186 vmx andrea # ./largepages3 memset page fault 2182791 memset tlb miss 460742 memset second tlb miss 459962 random access tlb miss 43981 random access second tlb miss 43988 ============ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #define SIZE (3UL*1024*1024*1024) int main() { char *p = malloc(SIZE), *p2; struct timeval before, after; gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset page fault %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); memset(p, 0, SIZE); gettimeofday(&after, NULL); printf("memset second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); gettimeofday(&before, NULL); for (p2 = p; p2 < p+SIZE; p2 += 4096) *p2 = 0; gettimeofday(&after, NULL); printf("random access second tlb miss %Lu\n", (after.tv_sec-before.tv_sec)*1000000UL + after.tv_usec-before.tv_usec); return 0; } ============ Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
__get_user_pages gets a new 'nonblocking' parameter to signal that the caller is prepared to re-acquire mmap_sem and retry the operation if needed. This is used to split off long operations if they are going to block on a disk transfer, or when we detect contention on the mmap_sem. [akpm@linux-foundation.org: remove ref to rwsem_is_contended()] Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 27 10月, 2010 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
page_order() is called by memory hotplug's user interface to check the section is removable or not. (is_mem_section_removable()) It calls page_order() withoug holding zone->lock. So, even if the caller does if (PageBuddy(page)) ret = page_order(page) ... The caller may hit BUG_ON(). For fixing this, there are 2 choices. 1. add zone->lock. 2. remove BUG_ON(). is_mem_section_removable() is used for some "advice" and doesn't need to be 100% accurate. This is_removable() can be called via user program.. We don't want to take this important lock for long by user's request. So, this patch removes BUG_ON(). Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NWu Fengguang <fengguang.wu@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 12月, 2009 5 次提交
-
-
由 Haicheng Li 提交于
In some use cases, user doesn't need extra filtering. E.g. user program can inject errors through madvise syscall to its own pages, however it might not know what the page state exactly is or which inode the page belongs to. So introduce an one-off interface "corrupt-filter-enable". Echo 0 to switch off page filters, and echo 1 to switch on the filters. [AK: changed default to 0] Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Andi Kleen 提交于
The hwpoison test suite need to inject hwpoison to a collection of selected task pages, and must not touch pages not owned by them and thus kill important system processes such as init. (But it's OK to mis-hwpoison free/unowned pages as well as shared clean pages. Mis-hwpoison of shared dirty pages will kill all tasks, so the test suite will target all or non of such tasks in the first place.) The memory cgroup serves this purpose well. We can put the target processes under the control of a memory cgroup, and tell the hwpoison injection code to only kill pages associated with some active memory cgroup. The prerequisite for doing hwpoison stress tests with mem_cgroup is, the mem_cgroup code tracks task pages _accurately_ (unless page is locked). Which we believe is/should be true. The benefits are simplification of hwpoison injector code. Also the mem_cgroup code will automatically be tested by hwpoison test cases. The alternative interfaces pin-pfn/unpin-pfn can also delegate the (process and page flags) filtering functions reliably to user space. However prototype implementation shows that this scheme adds more complexity than we wanted. Example test case: mkdir /cgroup/hwpoison usemem -m 100 -s 1000 & echo `jobs -p` > /cgroup/hwpoison/tasks memcg_ino=$(ls -id /cgroup/hwpoison | cut -f1 -d' ') echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg page-types -p `pidof init` --hwpoison # shall do nothing page-types -p `pidof usemem` --hwpoison # poison its pages [AK: Fix documentation] [Add fix for problem noticed by Li Zefan <lizf@cn.fujitsu.com>; dentry in the css could be NULL] CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Li Zefan <lizf@cn.fujitsu.com> CC: Paul Menage <menage@google.com> CC: Nick Piggin <npiggin@suse.de> CC: Andi Kleen <andi@firstfloor.org> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
When specified, only poison pages if ((page_flags & mask) == value). - corrupt-filter-flags-mask - corrupt-filter-flags-value This allows stress testing of many kinds of pages. Strictly speaking, the buddy pages requires taking zone lock, to avoid setting PG_hwpoison on a "was buddy but now allocated to someone" page. However we can just do nothing because we set PG_locked in the beginning, this prevents the page allocator from allocating it to someone. (It will BUG() on the unexpected PG_locked, which is fine for hwpoison testing.) [AK: Add select PROC_PAGE_MONITOR to satisfy dependency] CC: Nick Piggin <npiggin@suse.de> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
__memory_failure()'s workflow is set PG_hwpoison //... unset PG_hwpoison if didn't pass hwpoison filter That could kill unrelated process if it happens to page fault on the page with the (temporary) PG_hwpoison. The race should be big enough to appear in stress tests. Fix it by grabbing the page and checking filter at inject time. This also avoids the very noisy "Injecting memory failure..." messages. - we don't touch madvise() based injection, because the filters are generally not necessary for it. - if we want to apply the filters to h/w aided injection, we'd better to rearrange the logic in __memory_failure() instead of this patch. AK: fix documentation, use drain all, cleanups CC: Haicheng Li <haicheng.li@intel.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
Filesystem data/metadata present the most tricky-to-isolate pages. It requires careful code review and stress testing to get them right. The fs/device filter helps to target the stress tests to some specific filesystem pages. The filter condition is block device's major/minor numbers: - corrupt-filter-dev-major - corrupt-filter-dev-minor When specified (non -1), only page cache pages that belong to that device will be poisoned. The filters are checked reliably on the locked and refcounted page. Haicheng: clear PG_hwpoison and drop bad page count if filter not OK AK: Add documentation CC: Haicheng Li <haicheng.li@intel.com> CC: Nick Piggin <npiggin@suse.de> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-