1. 04 5月, 2017 1 次提交
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  2. 21 4月, 2017 1 次提交
  3. 08 4月, 2017 2 次提交
  4. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  5. 09 3月, 2017 1 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
  6. 02 3月, 2017 1 次提交
  7. 28 2月, 2017 1 次提交
  8. 25 2月, 2017 13 次提交
    • W
      mm/page_alloc.c: remove redundant init code for ZONE_MOVABLE · ad69444e
      Wei Yang 提交于
      arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE]
      is skipped in the loop.  No need to reset them to 0 again.
      
      This patch just removes the redundant code.
      
      Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad69444e
    • G
      mm/page_alloc: fix nodes for reclaim in fast path · e02dc017
      Gavin Shan 提交于
      When @node_reclaim_node isn't 0, the page allocator tries to reclaim
      pages if the amount of free memory in the zones are below the low
      watermark.  On Power platform, none of NUMA nodes are scanned for page
      reclaim because no nodes match the condition in zone_allows_reclaim().
      On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
      of Node-A to Node-A.  So the preferred node even won't be scanned for
      page reclaim.
      
         __alloc_pages_nodemask()
         get_page_from_freelist()
            zone_allows_reclaim()
      
      Anton proposed the test code as below:
      
         # cat alloc.c
            :
         int main(int argc, char *argv[])
         {
      	void *p;
      	unsigned long size;
      	unsigned long start, end;
      
      	start = time(NULL);
      	size = strtoul(argv[1], NULL, 0);
      	printf("To allocate %ldGB memory\n", size);
      
      	size <<= 30;
      	p = malloc(size);
      	assert(p);
      	memset(p, 0, size);
      
      	end = time(NULL);
      	printf("Used time: %ld seconds\n", end - start);
      	sleep(3600);
      	return 0;
         }
      
      The system I use for testing has two NUMA nodes.  Both have 128GB
      memory.  In below scnario, the page caches on node#0 should be reclaimed
      when it encounters pressure to accommodate request of allocation.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches; \
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619712 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619840 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:          186816 kB
      
      With the patch applied, the pagecache on node-0 is reclaimed when its
      free memory is running out.  It's the expected behaviour.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33605568 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:        1379520 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:           317120 kB
      
      Fixes: 5f7a75ac ("mm: page_alloc: do not cache reclaim distances")
      Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.comSigned-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02dc017
    • M
    • L
      mm: alloc_contig_range: allow to specify GFP mask · ca96b625
      Lucas Stach 提交于
      Currently alloc_contig_range assumes that the compaction should be done
      with the default GFP_KERNEL flags.  This is probably right for all
      current uses of this interface, but may change as CMA is used in more
      use-cases (including being the default DMA memory allocator on some
      platforms).
      
      Change the function prototype, to allow for passing through the GFP mask
      set by upper layers.
      
      Also respect global restrictions by applying memalloc_noio_flags to the
      passed in flags.
      
      Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.deSigned-off-by: NLucas Stach <l.stach@pengutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alexander Graf <agraf@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca96b625
    • Y
      mm/hotplug: enable memory hotplug for non-lru movable pages · 0efadf48
      Yisheng Xie 提交于
      We had considered all of the non-lru pages as unmovable before commit
      bda807d4 ("mm: migrate: support non-lru movable page migration").
      But now some of non-lru pages like zsmalloc, virtio-balloon pages also
      become movable.  So we can offline such blocks by using non-lru page
      migration.
      
      This patch straightforwardly adds non-lru migration code, which means
      adding non-lru related code to the functions which scan over pfn and
      collect pages to be migrated and isolate them before migration.
      Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0efadf48
    • M
      mm, page_alloc: use static global work_struct for draining per-cpu pages · bd233f53
      Mel Gorman 提交于
      As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
      work_struct to co-ordinate the draining of per-cpu pages on the
      workqueue.  Only one task can drain at a time but this is better than
      the previous scheme that allowed multiple tasks to send IPIs at a time.
      
      One consideration is whether parallel requests should synchronise
      against each other.  This patch does not synchronise for a global drain
      as the common case for such callers is expected to be multiple parallel
      direct reclaimers competing for pages when the watermark is close to
      min.  Draining the per-cpu list is unlikely to make much progress and
      serialising the drain is of dubious merit.  Drains are synchonrised for
      callers such as memory hotplug and CMA that care about the drain being
      complete when the function returns.
      
      Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd233f53
    • V
      mm, page_alloc: don't check cpuset allowed twice in fast-path · 51047820
      Vlastimil Babka 提交于
      Since commit 682a3385 ("mm, page_alloc: inline the fast path of the
      zonelist iterator") we replace a NULL nodemask with
      cpuset_current_mems_allowed in the fast path, so that
      get_page_from_freelist() filters nodes allowed by the cpuset via
      for_next_zone_zonelist_nodemask().
      
      In that case it's pointless to additionaly check __cpuset_zone_allowed()
      in each iteration, which we can avoid by not adding ALLOC_CPUSET to
      alloc_flags in that scenario.
      
      This saves some cycles in the allocator fast path on systems with one or
      more non-root cpuset configured.  In the slow path, ALLOC_CPUSET is
      reset according to __alloc_pages_slowpath().  Without configured
      cpusets, this code is disabled by a static key.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51047820
    • V
      mm, page_alloc: remove redundant checks from alloc fastpath · df76cee6
      Vlastimil Babka 提交于
      The allocation fast path contains two similar checks for zoneref->zone
      being NULL, where zoneref points either to the first zone in the
      zonelist, or to the preferred zone.  These can be NULL either due to
      empty zonelist, or no zone being compatible with given nodemask or
      task's cpuset.
      
      These checks are unnecessary, because the zonelist walks in
      first_zones_zonelist() and get_page_from_freelist() handle a NULL
      starting zoneref->zone or preferred_zoneref->zone safely.  It's safe to
      fallback to __alloc_pages_slowpath() where we also have the check early
      enough.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df76cee6
    • M
      mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
      Mel Gorman 提交于
      Many workloads that allocate pages are not handling an interrupt at a
      time.  As allocation requests may be from IRQ context, it's necessary to
      disable/enable IRQs for every page allocation.  This cost is the bulk of
      the free path but also a significant percentage of the allocation path.
      
      This patch alters the locking and checks such that only irq-safe
      allocation requests use the per-cpu allocator.  All others acquire the
      irq-safe zone->lock and allocate from the buddy allocator.  It relies on
      disabling preemption to safely access the per-cpu structures.  It could
      be slightly modified to avoid soft IRQs using it but it's not clear it's
      worthwhile.
      
      This modification may slow allocations from IRQ context slightly but the
      main gain from the per-cpu allocator is that it scales better for
      allocations from multiple contexts.  There is an implicit assumption
      that intensive allocations from IRQ contexts on multiple CPUs from a
      single NUMA node are rare and that the fast majority of scaling issues
      are encountered in !IRQ contexts such as page faulting.  It's worth
      noting that this patch is not required for a bulk page allocator but it
      significantly reduces the overhead.
      
      The following is results from a page allocator micro-benchmark.  Only
      order-0 is interesting as higher orders do not use the per-cpu allocator
      
                                                4.10.0-rc2                 4.10.0-rc2
                                                   vanilla               irqsafe-v1r5
      Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
      Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
      Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
      Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
      Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
      Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
      Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
      Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
      Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
      Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
      Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
      Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
      Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
      Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
      Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
      Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
      Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
      Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
      Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
      Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
      Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
      Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
      Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
      Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
      Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
      Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
      Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
      Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
      Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
      Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
      Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
      Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
      Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
      Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
      Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
      Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
      Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
      Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
      Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
      Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
      Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
      Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
      Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
      Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
      Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
      
      This is the alloc, free and total overhead of allocating order-0 pages
      in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
      overhead massively reduces overhead.  Alloc overhead is roughly reduced
      by 14-20% in most cases.  The free path is reduced by 26-46% and the
      total reduction is significant.
      
      Many users require zeroing of pages from the page allocator which is the
      vast cost of allocation.  Hence, the impact on a basic page faulting
      benchmark is not that significant
      
                                    4.10.0-rc2            4.10.0-rc2
                                       vanilla          irqsafe-v1r5
      Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
      Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
      Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
      Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
      CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
      CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
      Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
      Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
      
      This is from aim9 and the most notable outcome is that fault variability
      is reduced by the patch.  The headline improvement is small as the
      overall fault cost, zeroing, page table insertion etc dominate relative
      to disabling/enabling IRQs in the per-cpu allocator.
      
      Similarly, little benefit was seen on networking benchmarks both
      localhost and between physical server/clients where other costs
      dominate.  It's possible that this will only be noticable on very high
      speed networks.
      
      Jesper Dangaard Brouer independently tested this with a separate
      microbenchmark from
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      Micro-benchmarked with [1] page_bench02:
       modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
        rmmod page_bench02 ; dmesg --notime | tail -n 4
      
      Compared to baseline: 213 cycles(tsc) 53.417 ns
       - against this     : 184 cycles(tsc) 46.056 ns
       - Saving           : -29 cycles
       - Very close to expected 27 cycles saving [see below [2]]
      
      Micro benchmarking via time_bench_sample[3], we get the cost of these
      operations:
      
       time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
       time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
       time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
       time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
       time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
       time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
       time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
       time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
       time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
       [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
       time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
       [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
       time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
       time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
       time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
      
      Thus, expected improvement is: 38-11 = 27 cycles.
      
      [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
        Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      374ad05a
    • M
      mm, page_alloc: do not depend on cpu hotplug locks inside the allocator · a459eeb7
      Michal Hocko 提交于
      Dmitry has reported the following lockdep splat
        lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
        __mutex_lock_common kernel/locking/mutex.c:521 [inline]
        mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
        pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
        __alloc_percpu+0x24/0x30 mm/percpu.c:1075
        smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
        cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
        cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
        _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      pcpu_alloc
        pcpu_alloc_mutex
      
        get_online_cpus+0x62/0x90 kernel/cpu.c:248
        drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
        __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
        __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
        __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
        __alloc_pages include/linux/gfp.h:426 [inline]
        __alloc_pages_node include/linux/gfp.h:439 [inline]
        alloc_pages_node include/linux/gfp.h:453 [inline]
        pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
        pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
        pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
        __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
        bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
        array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
        find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
        map_create kernel/bpf/syscall.c:188 [inline]
        SYSC_bpf kernel/bpf/syscall.c:870 [inline]
        SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
        entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      pcpu_alloc
        pcpu_alloc_mutex
      drain_all_pages
        get_online_cpus
          cpu_hotplug.lock
      
        cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
        _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      
      Pulling cpu hotplug locks inside the page allocator is just too
      dangerous.  Let's remove the dependency by dropping get_online_cpus()
      from drain_all_pages.  This is not so simple though because now we do
      not have a protection against cpu hotplug which means 2 things:
      
        - the work item might be executed on a different cpu in worker from
          unbound pool so it doesn't run on pinned on the cpu
      
        - we have to make sure that we do not race with page_alloc_cpu_dead
          calling drain_pages_zone
      
      Disabling preemption in drain_local_pages_wq will solve the first
      problem drain_local_pages will determine its local CPU from the WQ
      context which will be stable after that point, page_alloc_cpu_dead is
      pinned to the CPU already.  The later condition is achieved by disabling
      IRQs in drain_pages_zone.
      
      Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
      Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a459eeb7
    • M
      mm, page_alloc: drain per-cpu pages from workqueue context · 0ccce3b9
      Mel Gorman 提交于
      The per-cpu page allocator can be drained immediately via
      drain_all_pages() which sends IPIs to every CPU.  In the next patch, the
      per-cpu allocator will only be used for interrupt-safe allocations which
      prevents draining it from IPI context.  This patch uses workqueues to
      drain the per-cpu lists instead.
      
      This is slower but no slowdown during intensive reclaim was measured and
      the paths that use drain_all_pages() are not that sensitive to
      performance.  This is particularly true as the path would only be
      triggered when reclaim is failing.  It also makes a some sense to avoid
      storming a machine with IPIs when it's under memory pressure.  Arguably,
      it should be further adjusted so that only one caller at a time is
      draining pages but it's beyond the scope of the current patch.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccce3b9
    • M
      mm, page_alloc: split alloc_pages_nodemask() · 9cd75558
      Mel Gorman 提交于
      alloc_pages_nodemask does a number of preperation steps that determine
      what zones can be used for the allocation depending on a variety of
      factors.  This is fine but a hypothetical caller that wanted multiple
      order-0 pages has to do the preparation steps multiple times.  This
      patch structures __alloc_pages_nodemask such that it's relatively easy
      to build a bulk order-0 page allocator.  There is no functional change.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cd75558
    • M
      mm, page_alloc: split buffered_rmqueue() · 066b2393
      Mel Gorman 提交于
      Patch series "Use per-cpu allocator for !irq requests and prepare for a
      bulk allocator", v5.
      
      This series is motivated by a conversation led by Jesper Dangaard Brouer
      at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
      Part of his motivation was due to the overhead of allocating multiple
      order-0 that led some drivers to use high-order allocations and
      splitting them.  This is very slow in some cases.
      
      The first two patches in this series restructure the page allocator such
      that it is relatively easy to introduce an order-0 bulk page allocator.
      A patch exists to do that and has been handed over to Jesper until an
      in-kernel users is created.  The third patch prevents the per-cpu
      allocator being drained from IPI context as that can potentially corrupt
      the list after patch four is merged.  The final patch alters the per-cpu
      alloctor to make it exclusive to !irq requests.  This cuts
      allocation/free overhead by roughly 30%.
      
      Performance tests from both Jesper and me are included in the patch.
      
      This patch (of 4):
      
      buffered_rmqueue removes a page from a given zone and uses the per-cpu
      list for order-0.  This is fine but a hypothetical caller that wanted
      multiple order-0 pages has to disable/reenable interrupts multiple
      times.  This patch structures buffere_rmqueue such that it's relatively
      easy to build a bulk order-0 page allocator.  There is no functional
      change.
      
      [mgorman@techsingularity.net: failed per-cpu refill may blow up]
        Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      066b2393
  9. 23 2月, 2017 13 次提交
    • D
      mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled · 685dbf6f
      David Rientjes 提交于
      The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets
      the allocation nodemask to cpuset_current_mems_allowed when there is no
      effective mempolicy.  cpuset_current_mems_allowed is only effective when
      cpusets are enabled, which is also printed by warn_alloc(), so setting
      the nodemask to cpuset_current_mems_allowed is redundant and prevents
      debugging issues where ac->nodemask is not set properly in the page
      allocator.
      
      This provides better debugging output since
      cpuset_print_current_mems_allowed() is already provided.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      685dbf6f
    • M
      mm: help __GFP_NOFAIL allocations which do not trigger OOM killer · 6c18ba7a
      Michal Hocko 提交于
      Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
      we are left with requests which require to loop inside the allocator
      without invoking the oom killer (e.g.  GFP_NOFS|__GFP_NOFAIL used by fs
      code) and so they might, in very unlikely situations, loop for ever -
      e.g.  other parallel request could starve them.
      
      This patch tries to limit the likelihood of such a lockup by giving
      these __GFP_NOFAIL requests a chance to move on by consuming a small
      part of memory reserves.  We are using ALLOC_HARDER which should be
      enough to prevent from the starvation by regular allocation requests,
      yet it shouldn't consume enough from the reserves to disrupt high
      priority requests (ALLOC_HIGH).
      
      While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
      which enforces the cpusets but allows to fallback to ignore them if the
      first attempt fails.  __GFP_NOFAIL requests can be considered important
      enough to allow cpuset runaway in order for the system to move on.  It
      is highly unlikely that any of these will be GFP_USER anyway.
      
      Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c18ba7a
    • M
      mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically · 06ad276a
      Michal Hocko 提交于
      __alloc_pages_may_oom makes sure to skip the OOM killer depending on the
      allocation request.  This includes lowmem requests, costly high order
      requests and others.  For a long time __GFP_NOFAIL acted as an override
      for all those rules.  This is not documented and it can be quite
      surprising as well.  E.g.  GFP_NOFS requests are not invoking the OOM
      killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
      the existing open coded loops around allocator to nofail request (and we
      have done that in the past) then such a change would have a non trivial
      side effect which is far from obvious.  Note that the primary motivation
      for skipping the OOM killer is to prevent from pre-mature invocation.
      
      The exception has been added by commit 82553a93 ("oom: invoke oom
      killer for __GFP_NOFAIL").  The changelog points out that the oom killer
      has to be invoked otherwise the request would be looping for ever.  But
      this argument is rather weak because the OOM killer doesn't really
      guarantee a forward progress for those exceptional cases:
      
      - it will hardly help to form costly order which in turn can result in
        the system panic because of no oom killable task in the end - I believe
        we certainly do not want to put the system down just because there is a
        nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
        the consequences.  It is much better this request would loop for ever
        than the massive system disruption
      
      - lowmem is also highly unlikely to be freed during OOM killer
      
      - GFP_NOFS request could trigger while there is still a lot of memory
        pinned by filesystems.
      
      This patch simply removes the __GFP_NOFAIL special case in order to have a
      more clear semantic without surprising side effects.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06ad276a
    • M
      mm: consolidate GFP_NOFAIL checks in the allocator slowpath · 9a67f648
      Michal Hocko 提交于
      Tetsuo Handa has pointed out that commit 0a0337e0 ("mm, oom: rework
      oom detection") has subtly changed semantic for costly high order
      requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail
      right now.  My code inspection didn't reveal any such users in the tree
      but it is true that this might lead to unexpected allocation failures
      and subsequent OOPs.
      
      __alloc_pages_slowpath wrt.  GFP_NOFAIL is hard to follow currently.
      There are few special cases but we are lacking a catch all place to be
      sure we will not miss any case where the non failing allocation might
      fail.  This patch reorganizes the code a bit and puts all those special
      cases under nopage label which is the generic go-to-fail path.  Non
      failing allocations are retried or those that cannot retry like
      non-sleeping allocation go to the failure point directly.  This should
      make the code flow much easier to follow and make it less error prone
      for future changes.
      
      While we are there we have to move the stall check up to catch
      potentially looping non-failing allocations.
      
      [akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized]
      Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a67f648
    • M
      lib/show_mem.c: teach show_mem to work with the given nodemask · 9af744d7
      Michal Hocko 提交于
      show_mem() allows to filter out node specific data which is irrelevant
      to the allocation request via SHOW_MEM_FILTER_NODES.  The filtering is
      done in skip_free_areas_node which skips all nodes which are not in the
      mems_allowed of the current process.  This works most of the time as
      expected because the nodemask shouldn't be outside of the allocating
      task but there are some exceptions.  E.g.  memory hotplug might want to
      request allocations from outside of the allowed nodes (see
      new_node_page).
      
      Get rid of this hardcoded behavior and push the allocation mask down the
      show_mem path and use it instead of cpuset_current_mems_allowed.  NULL
      nodemask is interpreted as cpuset_current_mems_allowed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9af744d7
    • M
      mm, page_alloc: warn_alloc print nodemask · a8e99259
      Michal Hocko 提交于
      warn_alloc is currently used for to report an allocation failure or an
      allocation stall.  We print some details of the allocation request like
      the gfp mask and the request order.  We do not print the allocation
      nodemask which is important when debugging the reason for the allocation
      failure as well.  We alreaddy print the nodemask in the OOM report.
      
      Add nodemask to warn_alloc and print it in warn_alloc as well.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e99259
    • M
      mm, page_alloc: do not report all nodes in show_mem · c02e50bb
      Michal Hocko 提交于
      Patch series "show_mem updates", v2.
      
      This is a mixture of one bug fix (patch 1), an enhancement (patch 2) and
      cleanups (the rest of the series).  First two patches should be really
      straightforward.  Patch 3 removes some arch specific show_mem
      implementations because I think they are quite outdated and do not
      really serve any useful purpose anymore.  I think we should really
      strive to have a consistent show_mem output regardless of the
      architecture.  If some architecture is really special and wants to dump
      something additional we should do that via an arch specific hook.
      
      The last patch adds nodemask parameter so that we do not rely on the
      hardcoded mems_allowed of the current task when doing the node
      filtering.  I consider this more a cleanup than a fix because basically
      all users use a nodemask which is a subset of mems_allowed.  There is
      only one call path in the memory hotplug which doesn't comply with this
      but that is hardly something to worry about.
      
      This patch (of 4):
      
      Commit 599d0c95 ("mm, vmscan: move LRU lists to node") has added per
      numa node statistics to show_mem but it forgot to add
      skip_free_areas_node to filter out nodes which are outside of the
      allocating task numa policy.  Add this check to not pollute the output
      with the pointless information.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c02e50bb
    • P
      mm: page_alloc: skip over regions of invalid pfns where possible · b92df1de
      Paul Burton 提交于
      When using a sparse memory model memmap_init_zone() when invoked with
      the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
      which aren't in a populated region of the sparse memory map.  However if
      the memory map is extremely sparse then it can spend a long time
      linearly checking each PFN in a large non-populated region of the memory
      map & skipping it in turn.
      
      When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
      information to quickly discover the next valid PFN given an invalid one
      by searching through the list of memory regions & skipping forwards to
      the first PFN covered by the memory region to the right of the
      non-populated region.  Implement this in order to speed up
      memmap_init_zone() for systems with extremely sparse memory maps.
      
      James said "I have tested this patch on a virtual model of a Samurai CPU
      with a sparse memory map.  The kernel boot time drops from 109 to
      62 seconds. "
      
      Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.comSigned-off-by: NPaul Burton <paul.burton@imgtec.com>
      Tested-by: NJames Hartley <james.hartley@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b92df1de
    • M
      oom, trace: add compaction retry tracepoint · 65190cff
      Michal Hocko 提交于
      Higher order requests oom debugging is currently quite hard.  We do have
      some compaction points which can tell us how the compaction is operating
      but there is no trace point to tell us about compaction retry logic.
      This patch adds a one which will have the following format
      
                  bash-3126  [001] ....  1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0
      
      we can see that the order 9 request is not retried even though we are in
      the highest compaction priority mode becase the last compaction attempt
      was withdrawn.  This means that compaction_zonelist_suitable must have
      returned false and there is no suitable zone to compact for this request
      and so no need to retry further.
      
      another example would be
                 <...>-3137  [001] ....    81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
      
      in this case the order-9 compaction failed to find any suitable block.
      We do not retry anymore because this is a costly request and those do
      not go below COMPACT_PRIO_SYNC_LIGHT priority.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65190cff
    • M
      oom, trace: add oom detection tracepoints · d379f01d
      Michal Hocko 提交于
      should_reclaim_retry is the central decision point for declaring the
      OOM.  It might be really useful to expose data used for this decision
      making when debugging an unexpected oom situations.
      
      Say we have an OOM report:
      [   52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
      [   52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G        W       4.8.0-oomtrace3-00006-gb21338b386d2 #1024
      
      Now we can check the tracepoint data to see how we have ended up in this
      situation:
             mem_eater-3148  [003] ....    52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
             mem_eater-3148  [003] ....    52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
             mem_eater-3148  [003] ....    52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
             mem_eater-3148  [003] ....    52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
             mem_eater-3148  [003] ....    52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0
      
      The above shows that we can quickly deduce that the reclaim stopped
      making any progress (see no_progress_loops increased in each round) and
      while there were still some 51 reclaimable pages they couldn't be
      dropped for some reason (vmscan trace points would tell us more about
      that part).  available will represent reclaimable + free_pages scaled
      down per no_progress_loops factor.  This is essentially an optimistic
      estimate of how much memory we would have when reclaiming everything.
      This can be compared to min_wmark to get a rought idea but the
      wmark_check tells the result of the watermark check which is more
      precise (includes lowmem reserves, considers the order etc.).  As we can
      see no zone is eligible in the end and that is why we have triggered the
      oom in this situation.
      
      Please note that higher order requests might fail on the wmark_check
      even when there is much more memory available than min_wmark - e.g.
      when the memory is fragmented.  A follow up tracepoint will help to
      debug those situations.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d379f01d
    • V
      mm, page_alloc: avoid page_to_pfn() when merging buddies · 13ad59df
      Vlastimil Babka 提交于
      On architectures that allow memory holes, page_is_buddy() has to perform
      page_to_pfn() to check for the memory hole.  After the previous patch,
      we have the pfn already available in __free_one_page(), which is the
      only caller of page_is_buddy(), so move the check there and avoid
      page_to_pfn().
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13ad59df
    • V
      mm, page_alloc: don't convert pfn to idx when merging · 76741e77
      Vlastimil Babka 提交于
      In __free_one_page() we do the buddy merging arithmetics on "page/buddy
      index", which is just the lower MAX_ORDER bits of pfn.  The operations
      we do that affect the higher bits are bitwise AND and subtraction (in
      that order), where the final result will be the same with the higher
      bits left unmasked, as long as these bits are equal for both buddies -
      which must be true by the definition of a buddy.
      
      We can therefore use pfn's directly instead of "index" and skip the
      zeroing of >MAX_ORDER bits.  This can help a bit by itself, although
      compiler might be smart enough already.  It also helps the next patch to
      avoid page_to_pfn() for memory hole checks.
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76741e77
    • M
      mm: throttle show_mem() from warn_alloc() · aa187507
      Michal Hocko 提交于
      Tetsuo has been stressing OOM killer path with many parallel allocation
      requests when he has noticed that it is not all that hard to swamp
      kernel logs with warn_alloc messages caused by allocation stalls.  Even
      though the allocation stall message is triggered only once in 10s there
      might be many different tasks hitting it roughly around the same time.
      
      A big part of the output is show_mem() which can generate a lot of
      output even on a small machines.  There is no reason to show the state
      of memory counter for each allocation stall, especially when multiple of
      them are reported in a short time period.  Chances are that not much has
      changed since the last report.  This patch simply rate limits show_mem
      called from warn_alloc to only dump something once per second.  This
      should be enough to give us a clue why an allocation might be stalling
      while burst of warnings will not swamp log with too much data.
      
      While we are at it, extract all the show_mem related handling (filters)
      into a separate function warn_alloc_show_mem.  This will make the code
      cleaner and as a bonus point we can distinguish which part of warn_alloc
      got throttled due to rate limiting as ___ratelimit dumps the caller.
      
      [akpm@linux-foundation.org: reduce scope of the ratelimit_states]
      Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa187507
  10. 25 1月, 2017 5 次提交
  11. 11 1月, 2017 1 次提交