1. 27 4月, 2019 4 次提交
  2. 20 4月, 2019 1 次提交
    • Q
      mm/hotplug: treat CMA pages as unmovable · 1a9f2191
      Qian Cai 提交于
      has_unmovable_pages() is used by allocating CMA and gigantic pages as
      well as the memory hotplug.  The later doesn't know how to offline CMA
      pool properly now, but if an unused (free) CMA page is encountered, then
      has_unmovable_pages() happily considers it as a free memory and
      propagates this up the call chain.  Memory offlining code then frees the
      page without a proper CMA tear down which leads to an accounting issues.
      Moreover if the same memory range is onlined again then the memory never
      gets back to the CMA pool.
      
      State after memory offline:
      
       # grep cma /proc/vmstat
       nr_free_cma 205824
      
       # cat /sys/kernel/debug/cma/cma-kvm_cma/count
       209920
      
      Also, kmemleak still think those memory address are reserved below but
      have already been used by the buddy allocator after onlining.  This
      patch fixes the situation by treating CMA pageblocks as unmovable except
      when has_unmovable_pages() is called as part of CMA allocation.
      
        Offlined Pages 4096
        kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
        Call Trace:
          dump_stack+0xb0/0xf4 (unreliable)
          create_object+0x344/0x380
          __kmalloc_node+0x3ec/0x860
          kvmalloc_node+0x58/0x110
          seq_read+0x41c/0x620
          __vfs_read+0x3c/0x70
          vfs_read+0xbc/0x1a0
          ksys_read+0x7c/0x140
          system_call+0x5c/0x70
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xc000201cc8000000 (size 13757317120):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294937297
        kmemleak:   min_count = -1
        kmemleak:   count = 0
        kmemleak:   flags = 0x5
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             cma_declare_contiguous+0x2a4/0x3b0
             kvm_cma_reserve+0x11c/0x134
             setup_arch+0x300/0x3f8
             start_kernel+0x9c/0x6e8
             start_here_common+0x1c/0x4b0
        kmemleak: Automatic memory scanning thread ended
      
      [cai@lca.pw: use is_migrate_cma_page() and update commit log]
        Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a9f2191
  3. 30 3月, 2019 1 次提交
    • Q
      mm/hotplug: fix offline undo_isolate_page_range() · 9b7ea46a
      Qian Cai 提交于
      Commit f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded
      memory to zones until online") introduced move_pfn_range_to_zone() which
      calls memmap_init_zone() during onlining a memory block.
      memmap_init_zone() will reset pagetype flags and makes migrate type to
      be MOVABLE.
      
      However, in __offline_pages(), it also call undo_isolate_page_range()
      after offline_isolated_pages() to do the same thing.  Due to commit
      2ce13640 ("mm: __first_valid_page skip over offline pages") changed
      __first_valid_page() to skip offline pages, undo_isolate_page_range()
      here just waste CPU cycles looping around the offlining PFN range while
      doing nothing, because __first_valid_page() will return NULL as
      offline_isolated_pages() has already marked all memory sections within
      the pfn range as offline via offline_mem_sections().
      
      Also, after calling the "useless" undo_isolate_page_range() here, it
      reaches the point of no returning by notifying MEM_OFFLINE.  Those pages
      will be marked as MIGRATE_MOVABLE again once onlining.  The only thing
      left to do is to decrease the number of isolated pageblocks zone counter
      which would make some paths of the page allocation slower that the above
      commit introduced.
      
      Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
      on ppc64, an "int" should still be enough to represent the number of
      pageblocks there.  Fix an incorrect comment along the way.
      
      [cai@lca.pw: v4]
        Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
      Fixes: 2ce13640 ("mm: __first_valid_page skip over offline pages")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b7ea46a
  4. 13 3月, 2019 1 次提交
    • M
      memblock: drop memblock_alloc_*_nopanic() variants · 26fb3dae
      Mike Rapoport 提交于
      As all the memblock allocation functions return NULL in case of error
      rather than panic(), the duplicates with _nopanic suffix can be removed.
      
      Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: Petr Mladek <pmladek@suse.com>		[printk]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Guo Ren <ren_guo@c-sky.com>				[c-sky]
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Juergen Gross <jgross@suse.com>			[Xen]
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Rob Herring <robh+dt@kernel.org>
      Cc: Rob Herring <robh@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26fb3dae
  5. 06 3月, 2019 14 次提交
  6. 22 2月, 2019 1 次提交
  7. 18 2月, 2019 1 次提交
  8. 15 2月, 2019 1 次提交
    • J
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 2c2ade81
      Jann Horn 提交于
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c2ade81
  9. 29 1月, 2019 1 次提交
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 4aa9fc2a
      Michal Hocko 提交于
      This reverts commit 2830bf6f.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4aa9fc2a
  10. 09 1月, 2019 1 次提交
    • M
      mm, page_alloc: do not wake kswapd with zone lock held · 73444bc4
      Mel Gorman 提交于
      syzbot reported the following regression in the latest merge window and
      it was confirmed by Qian Cai that a similar bug was visible from a
      different context.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.20.0+ #297 Not tainted
        ------------------------------------------------------
        syz-executor0/8529 is trying to acquire lock:
        000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
        __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
      
        but task is already holding lock:
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
        include/linux/spinlock.h:329 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
        mm/page_alloc.c:2548 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
        mm/page_alloc.c:3021 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
        mm/page_alloc.c:3050 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
        mm/page_alloc.c:3072 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
        get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
      
      It appears to be a false positive in that the only way the lock ordering
      should be inverted is if kswapd is waking itself and the wakeup
      allocates debugging objects which should already be allocated if it's
      kswapd doing the waking.  Nevertheless, the possibility exists and so
      it's best to avoid the problem.
      
      This patch flags a zone as needing a kswapd using the, surprisingly,
      unused zone flag field.  The flag is read without the lock held to do
      the wakeup.  It's possible that the flag setting context is not the same
      as the flag clearing context or for small races to occur.  However, each
      race possibility is harmless and there is no visible degredation in
      fragmentation treatment.
      
      While zone->flag could have continued to be unused, there is potential
      for moving some existing fields into the flags field instead.
      Particularly read-mostly ones like zone->initialized and
      zone->contiguous.
      
      Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
      Fixes: 1c30844d ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73444bc4
  11. 29 12月, 2018 14 次提交
    • B
      mm/page_alloc.c: allow error injection · af3b8544
      Benjamin Poirier 提交于
      Model call chain after should_failslab().  Likewise, we can now use a
      kprobe to override the return value of should_fail_alloc_page() and inject
      allocation failures into alloc_page*().
      
      This will allow injecting allocation failures using the BCC tools even
      without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
      fail_page_alloc= parameter, which incurs some overhead even when failures
      are not being injected.  On the other hand, this patch adds an
      unconditional call to should_fail_alloc_page() from page allocation
      hotpath.  That overhead should be rather negligible with
      CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.
      
      [vbabka@suse.cz: changelog addition]
      Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.comSigned-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af3b8544
    • W
      mm, page_alloc: enable pcpu_drain with zone capability · d9367bd0
      Wei Yang 提交于
      drain_all_pages is documented to drain per-cpu pages for a given zone (if
      non-NULL).  The current implementation doesn't match the description
      though.  It will drain all pcp pages for all zones that happen to have
      cached pages on the same cpu as the given zone.  This will lead to
      premature pcp cache draining for zones that are not of any interest to the
      caller - e.g.  compaction, hwpoison or memory offline.
      
      This forces the page allocator to take locks and potential lock contention
      as a result.
      
      There is no real reason for this sub-optimal implementation.  Replace
      per-cpu work item with a dedicated structure which contains a pointer to
      the zone and pass it over to the worker.  This will get the zone
      information all the way down to the worker function and do the right job.
      
      [akpm@linux-foundation.org: avoid 80-col tricks]
      [mhocko@suse.com: refactor the whole changelog]
      Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9367bd0
    • W
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · 3c0c12cc
      Waiman Long 提交于
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0c12cc
    • P
      mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES · 125b860b
      Pingfan Liu 提交于
      Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
      If someone adds extra migrate type, then he may forget to enlarge the
      NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.
      
      NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
      two different .h file with reverse dependency, it is a little hard to
      refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
      such relation in compiling-time.
      
      Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.comSigned-off-by: NPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      125b860b
    • O
      mm/page_alloc.c: drop uneeded __meminit and __meminitdata · bbe5d993
      Oscar Salvador 提交于
      Since commit 03e85f9d ("mm/page_alloc: Introduce
      free_area_init_core_hotplug"), some functions changed to only be called
      during system initialization.  Concretly, free_area_init_node() and the
      functions that hang from it.
      
      Also, some variables are no longer used after the system has gone
      through initialization.  So this could be considered as a late clean-up
      for that patch.
      
      This patch changes the functions from __meminit to __init, and the
      variables from __meminitdata to __initdata.
      
      In return, we get some KBs back:
      
      Before:
        Freeing unused kernel image memory: 2472K
      
      After:
        Freeing unused kernel image memory: 2480K
      
      Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbe5d993
    • W
      mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init() · 23b68cfa
      Wei Yang 提交于
      When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
      each node's highest zone is initialized before defer stage.
      
      static_init_pgcnt is used to store the number of pages like this:
      
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                    pgdat->node_spanned_pages);
      
      because we don't want to overflow zone's range.
      
      But this is not necessary, since defer_init() is called like this:
      
        memmap_init_zone()
          for pfn in [start_pfn, end_pfn)
            defer_init(pfn, end_pfn)
      
      In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
      stop before calling defer_init().
      
      BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
      since nr_initialised is zone based instead of node based.  Even
      node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
      would have pages less than PAGES_PER_SECTION.
      
      Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b68cfa
    • Y
      mm, oom: reorganize the oom report in dump_header · ef8444ea
      yuzhoujian 提交于
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8444ea
    • A
    • A
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 1c30844d
      Mel Gorman 提交于
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c30844d
    • M
      mm: use alloc_flags to record if kswapd can wake · 0a79cdad
      Mel Gorman 提交于
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a79cdad
    • M
      mm: move zone watermark accesses behind an accessor · a9214443
      Mel Gorman 提交于
      This is a preparation patch only, no functional change.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9214443
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 6bb15450
      Mel Gorman 提交于
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bb15450
    • A
      mm/page_alloc.c: use a single function to free page · 742aa7fb
      Aaron Lu 提交于
      There are multiple places of freeing a page, they all do the same things
      so a common function can be used to reduce code duplicate.
      
      It also avoids bug fixed in one function but left in another.
      
      Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pawel Staszewski <pstaszewski@itcare.pl>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      742aa7fb