1. 18 3月, 2016 5 次提交
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • I
      mm/page_alloc.c: calculate 'available' memory in a separate function · d02bd27b
      Igor Redko 提交于
      Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
      statistics protocol, corresponding to 'Available' in /proc/meminfo.
      
      It indicates to the hypervisor how big the balloon can be inflated
      without pushing the guest system to swap.  This metric would be very
      useful in VM orchestration software to improve memory management of
      different VMs under overcommit.
      
      This patch (of 2):
      
      Factor out calculation of the available memory counter into a separate
      exportable function, in order to be able to use it in other parts of the
      kernel.
      
      In particular, it appears a relevant metric to report to the hypervisor
      via virtio-balloon statistics interface (in a followup patch).
      Signed-off-by: NIgor Redko <redkoi@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02bd27b
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
    • J
      sound: query dynamic DEBUG_PAGEALLOC setting · 505f6d22
      Joonsoo Kim 提交于
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NTakashi Iwai <tiwai@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505f6d22
    • N
      /proc/kpageflags: return KPF_BUDDY for "tail" buddy pages · 832fc1de
      Naoya Horiguchi 提交于
      Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
      is inconvenient when grasping how free pages are distributed.  This
      patch sets KPF_BUDDY for such pages.
      
      With this patch:
      
        $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
        MemFree:         3134992 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000400          779272     3044  __________B_______________________________ buddy
        0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
                     total          783657     3061
      
      783657 pages is 3134628 kB (roughly consistent with the global counter,)
      so it's OK.
      
      [akpm@linux-foundation.org: update comment, per Naoya]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com&gt;>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      832fc1de
  2. 16 3月, 2016 11 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • L
      mm/page_poison.c: enable PAGE_POISONING as a separate option · 8823b1db
      Laura Abbott 提交于
      Page poisoning is currently set up as a feature if architectures don't
      have architecture debug page_alloc to allow unmapping of pages.  It has
      uses apart from that though.  Clearing of the pages on free provides an
      increase in security as it helps to limit the risk of information leaks.
      Allow page poisoning to be enabled as a separate option independent of
      kernel_map pages since the two features do separate work.  Because of
      how hiberanation is implemented, the checks on alloc cannot occur if
      hibernation is enabled.  The runtime alloc checks can also be enabled
      with an option when !HIBERNATION.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8823b1db
    • V
      mm, debug: move bad flags printing to bad_page() · ff8e8116
      Vlastimil Babka 提交于
      Since bad_page() is the only user of the badflags parameter of
      dump_page_badflags(), we can move the code to bad_page() and simplify a
      bit.
      
      The dump_page_badflags() function is renamed to __dump_page() and can
      still be called separately from dump_page() for temporary debug prints
      where page_owner info is not desired.
      
      The only user-visible change is that page->mem_cgroup is printed before
      the bad flags.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8e8116
    • V
      mm, page_owner: dump page owner info from dump_page() · 4e462112
      Vlastimil Babka 提交于
      The page_owner mechanism is useful for dealing with memory leaks.  By
      reading /sys/kernel/debug/page_owner one can determine the stack traces
      leading to allocations of all pages, and find e.g.  a buggy driver.
      
      This information might be also potentially useful for debugging, such as
      the VM_BUG_ON_PAGE() calls to dump_page().  So let's print the stored
      info from dump_page().
      
      Example output:
      
        page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
        flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
        page dumped because: VM_BUG_ON_PAGE(1)
        page->mem_cgroup:ffff8801392c5000
        page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
         [<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
         [<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
        page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e462112
    • V
      mm, page_owner: print migratetype of page and pageblock, symbolic flags · 60f30350
      Vlastimil Babka 提交于
      The information in /sys/kernel/debug/page_owner includes the migratetype
      of the pageblock the page belongs to.  This is also checked against the
      page's migratetype (as declared by gfp_flags during its allocation), and
      the page is reported as Fallback if its migratetype differs from the
      pageblock's one.  t This is somewhat misleading because in fact fallback
      allocation is not the only reason why these two can differ.  It also
      doesn't direcly provide the page's migratetype, although it's possible
      to derive that from the gfp_flags.
      
      It's arguably better to print both page and pageblock's migratetype and
      leave the interpretation to the consumer than to suggest fallback
      allocation as the only possible reason.  While at it, we can print the
      migratetypes as string the same way as /proc/pagetypeinfo does, as some
      of the numeric values depend on kernel configuration.  For that, this
      patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
      of mm/vmstat.c to mm/page_alloc.c and exports it.
      
      With the new format strings for flags, we can now also provide symbolic
      page and gfp flags in the /sys/kernel/debug/page_owner file.  This
      replaces the positional printing of page flags as single letters, which
      might have looked nicer, but was limited to a subset of flags, and
      required the user to remember the letters.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
        PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b4058>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
         [<ffffffff81160523>] generic_file_read_iter+0x453/0x760
         [<ffffffff811e0d57>] __vfs_read+0xa7/0xd0
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60f30350
    • V
      mm, page_alloc: print symbolic gfp_flags on allocation failure · c5c990e8
      Vlastimil Babka 提交于
      It would be useful to translate gfp_flags into string representation
      when printing in case of an allocation failure, especially as the flags
      have been undergoing some changes recently and the script
      ./scripts/gfp-translate needs a matching source version to be accurate.
      
      Example output:
      
        stapio: page allocation failure: order:9, mode:0x2080020(GFP_ATOMIC)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c990e8
    • C
      mm/debug_pagealloc: ask users for default setting of debug_pagealloc · ea6eabb0
      Christian Borntraeger 提交于
      Since commit 031bc574 ("mm/debug-pagealloc: make debug-pagealloc
      boottime configurable") CONFIG_DEBUG_PAGEALLOC is by default not adding
      any page debugging.
      
      This resulted in several unnoticed bugs, e.g.
      
          https://lkml.kernel.org/g/<569F5E29.3090107@de.ibm.com>
      or
          https://lkml.kernel.org/g/<56A20F30.4050705@de.ibm.com>
      
      as this behaviour change was not even documented in Kconfig.
      
      Let's provide a new Kconfig symbol that allows to change the default
      back to enabled, e.g.  for debug kernels.  This also makes the change
      obvious to kernel packagers.
      
      Let's also change the Kconfig description for CONFIG_DEBUG_PAGEALLOC, to
      indicate that there are two stages of overhead.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6eabb0
    • A
      mm/page_alloc.c: rework code layout in memmap_init_zone() · b72d0ffb
      Andrew Morton 提交于
      This function is getting full of weird tricks to avoid word-wrapping.
      Use a goto to eliminate a tab stop then use the new space
      
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b72d0ffb
    • T
      mm/page_alloc.c: introduce kernelcore=mirror option · 342332e6
      Taku Izumi 提交于
      This patch extends existing "kernelcore" option and introduces
      kernelcore=mirror option.  By specifying "mirror" instead of specifying
      the amount of memory, non-mirrored (non-reliable) region will be
      arranged into ZONE_MOVABLE.
      
      [akpm@linux-foundation.org: fix build with CONFIG_HAVE_MEMBLOCK_NODE_MAP=n]
      Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      342332e6
    • T
      mm/page_alloc.c: calculate zone_start_pfn at zone_spanned_pages_in_node() · d91749c1
      Taku Izumi 提交于
      Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS
      complied with UEFI spec 2.5 can notify which ranges are mirrored
      (reliable) via EFI memory map.  Now Linux kernel utilize its information
      and allocates boot time memory from reliable region.
      
      My requirement is:
        - allocate kernel memory from mirrored region
        - allocate user memory from non-mirrored region
      
      In order to meet my requirement, ZONE_MOVABLE is useful.  By arranging
      non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel
      allocations.
      
      My idea is to extend existing "kernelcore" option and introduces
      kernelcore=mirror option.  By specifying "mirror" instead of specifying
      the amount of memory, non-mirrored region will be arranged into
      ZONE_MOVABLE.
      
      Earlier discussions are at:
       https://lkml.org/lkml/2015/10/9/24
       https://lkml.org/lkml/2015/10/15/9
       https://lkml.org/lkml/2015/11/27/18
       https://lkml.org/lkml/2015/12/8/836
      
      For example, suppose 2-nodes system with the following memory range:
      
        node 0 [mem 0x0000000000001000-0x000000109fffffff]
        node 1 [mem 0x00000010a0000000-0x000000209fffffff]
      and the following ranges are marked as reliable (mirrored):
        [0x0000000000000000-0x0000000100000000]
        [0x0000000100000000-0x0000000180000000]
        [0x0000000800000000-0x0000000880000000]
        [0x00000010a0000000-0x0000001120000000]
        [0x00000017a0000000-0x0000001820000000]
      
      If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are
      arranged like bellow:
      
       - node 0:
        ZONE_NORMAL : [0x0000000100000000-0x00000010a0000000]
        ZONE_MOVABLE: [0x0000000180000000-0x00000010a0000000]
       - node 1:
        ZONE_NORMAL : [0x00000010a0000000-0x00000020a0000000]
        ZONE_MOVABLE: [0x0000001120000000-0x00000020a0000000]
      
      In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated
      as absent pages, and vice versa.
      
      This patch (of 2):
      
      Currently each zone's zone_start_pfn is calculated at
      free_area_init_core().  However zone's range is fixed at the time when
      invoking zone_spanned_pages_in_node().
      
      This patch changes how each zone->zone_start_pfn is calculated in
      zone_spanned_pages_in_node().
      Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d91749c1
  3. 06 2月, 2016 1 次提交
    • V
      mm, hugetlb: don't require CMA for runtime gigantic pages · 080fe206
      Vlastimil Babka 提交于
      Commit 944d9fec ("hugetlb: add support for gigantic page allocation
      at runtime") has added the runtime gigantic page allocation via
      alloc_contig_range(), making this support available only when CONFIG_CMA
      is enabled.  Because it doesn't depend on MIGRATE_CMA pageblocks and the
      associated infrastructure, it is possible with few simple adjustments to
      require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.
      
      After this patch, alloc_contig_range() and related functions are
      available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
      enabled.  Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION.  This allows
      supporting runtime gigantic pages without the CMA-specific checks in
      page allocator fastpaths.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      080fe206
  4. 04 2月, 2016 1 次提交
  5. 16 1月, 2016 5 次提交
    • A
      mm/page_alloc.c: remove unused struct zone *z variable · f16f091b
      Alexander Kuleshov 提交于
      Remove unused struct zone *z variable which appeared in 86051ca5
      ("mm: fix usemap initialization").
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f16f091b
    • D
      x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc
      Dan Williams 提交于
      In support of providing struct page for large persistent memory
      capacities, use struct vmem_altmap to change the default policy for
      allocating memory for the memmap array.  The default vmemmap_populate()
      allocates page table storage area from the page allocator.  Given
      persistent memory capacities relative to DRAM it may not be feasible to
      store the memmap in 'System Memory'.  Instead vmem_altmap represents
      pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
      requests.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b94ffdc
    • K
      thp: introduce deferred_split_huge_page() · 9a982250
      Kirill A. Shutemov 提交于
      Currently we don't split huge page on partial unmap.  It's not an ideal
      situation.  It can lead to memory overhead.
      
      Furtunately, we can detect partial unmap on page_remove_rmap().  But we
      cannot call split_huge_page() from there due to locking context.
      
      It's also counterproductive to do directly from munmap() codepath: in
      many cases we will hit this from exit(2) and splitting the huge page
      just to free it up in small pages is not what we really want.
      
      The patch introduce deferred_split_huge_page() which put the huge page
      into queue for splitting.  The splitting itself will happen when we get
      memory pressure via shrinker interface.  The page will be dropped from
      list on freeing through compound page destructor.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a982250
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • K
      mm: sanitize page->mapping for tail pages · 1c290f64
      Kirill A. Shutemov 提交于
      We don't define meaning of page->mapping for tail pages.  Currently it's
      always NULL, which can be inconsistent with head page and potentially
      lead to problems.
      
      Let's poison the pointer to catch all illigal uses.
      
      page_rmapping(), page_mapping() and page_anon_vma() are changed to look
      on head page.
      
      The only illegal use I've caught so far is __GPF_COMP pages from sound
      subsystem, mapped with PTEs.  do_shared_fault() is changed to use
      page_rmapping() instead of direct access to fault_page->mapping.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c290f64
  6. 15 1月, 2016 10 次提交
  7. 13 12月, 2015 1 次提交
    • V
      mm: fix swapped Movable and Reclaimable in /proc/pagetypeinfo · 475a2f90
      Vlastimil Babka 提交于
      Commit 016c13da ("mm, page_alloc: use masks and shifts when
      converting GFP flags to migrate types") has swapped MIGRATE_MOVABLE and
      MIGRATE_RECLAIMABLE in the enum definition.  However, migratetype_names
      wasn't updated to reflect that.
      
      As a result, the file /proc/pagetypeinfo shows the counts for Movable as
      Reclaimable and vice versa.
      
      Additionally, commit 0aaa29a5 ("mm, page_alloc: reserve pageblocks
      for high-order atomic allocations on demand") introduced
      MIGRATE_HIGHATOMIC, but did not add a letter to distinguish it into
      show_migration_types(), so it doesn't appear in the listing of free
      areas during page alloc failures or oom kills.
      
      This patch fixes both problems.  The atomic reserves will show with a
      letter 'H' in the free areas listings.
      
      Fixes: 016c13da ("mm, page_alloc: use masks and shifts when converting GFP flags to migrate types")
      Fixes: 0aaa29a5 ("mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      475a2f90
  8. 11 11月, 2015 1 次提交
  9. 07 11月, 2015 5 次提交
    • K
      mm: use 'unsigned int' for page order · d00181b9
      Kirill A. Shutemov 提交于
      Let's try to be consistent about data type of page order.
      
      [sfr@canb.auug.org.au: fix build (type of pageblock_order)]
      [hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d00181b9
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
    • K
      mm: pack compound_dtor and compound_order into one word in struct page · f1e61557
      Kirill A. Shutemov 提交于
      The patch halves space occupied by compound_dtor and compound_order in
      struct page.
      
      For compound_order, it's trivial long -> short conversion.
      
      For get_compound_page_dtor(), we now use hardcoded table for destructor
      lookup and store its index in the struct page instead of direct pointer
      to destructor. It shouldn't be a big trouble to maintain the table: we
      have only two destructor and NULL currently.
      
      This patch free up one word in tail pages for reuse. This is preparation
      for the next patch.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1e61557
    • M
      mm, page_alloc: only enforce watermarks for order-0 allocations · 97a16fc8
      Mel Gorman 提交于
      The primary purpose of watermarks is to ensure that reclaim can always
      make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
      These assume that order-0 allocations are all that is necessary for
      forward progress.
      
      High-order watermarks serve a different purpose.  Kswapd had no high-order
      awareness before they were introduced
      (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
      particularly important when there were high-order atomic requests.  The
      watermarks both gave kswapd awareness and made a reserve for those atomic
      requests.
      
      There are two important side-effects of this.  The most important is that
      a non-atomic high-order request can fail even though free pages are
      available and the order-0 watermarks are ok.  The second is that
      high-order watermark checks are expensive as the free list counts up to
      the requested order must be examined.
      
      With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
      have high-order watermarks.  Kswapd and compaction still need high-order
      awareness which is handled by checking that at least one suitable
      high-order page is free.
      
      With the patch applied, there was little difference in the allocation
      failure rates as the atomic reserves are small relative to the number of
      allocation attempts.  The expected impact is that there will never be an
      allocation failure report that shows suitable pages on the free lists.
      
      The one potential side-effect of this is that in a vanilla kernel, the
      watermark checks may have kept a free page for an atomic allocation.  Now,
      we are 100% relying on the HighAtomic reserves and an early allocation to
      have allocated them.  If the first high-order atomic allocation is after
      the system is already heavily fragmented then it'll fail.
      
      [akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a16fc8
    • M
      mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand · 0aaa29a5
      Mel Gorman 提交于
      High-order watermark checking exists for two reasons -- kswapd high-order
      awareness and protection for high-order atomic requests.  Historically the
      kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
      high-order free pages for as long as possible.  This patch introduces
      MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
      allocations on demand and avoids using those blocks for order-0
      allocations.  This is more flexible and reliable than MIGRATE_RESERVE was.
      
      A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
      allocation request steals a pageblock but limits the total number to 1% of
      the zone.  Callers that speculatively abuse atomic allocations for
      long-lived high-order allocations to access the reserve will quickly fail.
       Note that SLUB is currently not such an abuser as it reclaims at least
      once.  It is possible that the pageblock stolen has few suitable
      high-order pages and will need to steal again in the near future but there
      would need to be strong justification to search all pageblocks for an
      ideal candidate.
      
      The pageblocks are unreserved if an allocation fails after a direct
      reclaim attempt.
      
      The watermark checks account for the reserved pageblocks when the
      allocation request is not a high-order atomic allocation.
      
      The reserved pageblocks can not be used for order-0 allocations.  This may
      allow temporary wastage until a failed reclaim reassigns the pageblock.
      This is deliberate as the intent of the reservation is to satisfy a
      limited number of atomic high-order short-lived requests if the system
      requires them.
      
      The stutter benchmark was used to evaluate this but while it was running
      there was a systemtap script that randomly allocated between 1 high-order
      page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC.  This
      is much larger than the potential reserve and it does not attempt to be
      realistic.  It is intended to stress random high-order allocations from an
      unknown source, show that there is a reduction in failures without
      introducing an anomaly where atomic allocations are more reliable than
      regular allocations.  The amount of memory reserved varied throughout the
      workload as reserves were created and reclaimed under memory pressure.
      The allocation failures once the workload warmed up were as follows;
      
      4.2-rc5-vanilla		70%
      4.2-rc5-atomic-reserve	56%
      
      The failure rate was also measured while building multiple kernels.  The
      failure rate was 14% but is 6% with this patch applied.
      
      Overall, this is a small reduction but the reserves are small relative to
      the number of allocation requests.  In early versions of the patch, the
      failure rate reduced by a much larger amount but that required much larger
      reserves and perversely made atomic allocations seem more reliable than
      regular allocations.
      
      [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aaa29a5