1. 07 7月, 2017 2 次提交
    • M
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko 提交于
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • M
      mm: remove return value from init_currently_empty_zone · dc0bbf3b
      Michal Hocko 提交于
      Patch series "mm: make movable onlining suck less", v4.
      
      Movable onlining is a real hack with many downsides - mainly
      reintroduction of lowmem/highmem issues we used to have on 32b systems -
      but it is the only way to make the memory hotremove more reliable which
      is something that people are asking for.
      
      The current semantic of memory movable onlinening is really cumbersome,
      however.  The main reason for this is that the udev driven approach is
      basically unusable because udev races with the memory probing while only
      the last memory block or the one adjacent to the existing zone_movable
      are allowed to be onlined movable.  In short the criterion for the
      successful online_movable changes under udev's feet.  A reliable udev
      approach would require a 2 phase approach where the first successful
      movable online would have to check all the previous blocks and online
      them in descending order.  This is hard to be considered sane.
      
      This patchset aims at making the onlining semantic more usable.  First
      of all it allows to online memory movable as long as it doesn't clash
      with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
      ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
      semantic so the zone always precedes the movable zone but I have plans
      to remove this restriction in future because it is not really necessary.
      
      First 3 patches are cleanups which should be ready to be merged right
      away (unless I have missed something subtle of course).
      
      Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
      
      Patch 5 deals with implicit assumptions of register_one_node on pgdat
      initialization.
      
      Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
      hope I got all of them right but people familiar with compaction should
      double check this.
      
      Patch 11 is the core of the change.  In order to make it easier to
      review I have tried it to be as minimalistic as possible and the large
      code removal is moved to patch 14.
      
      Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
      and finally patch 14 removes the unused code.
      
      I have tested the patches in kvm:
        # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
      
      and then probed the additional memory by
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      
      Then I have used this simple script to probe the memory block by hand
        # cat probe_memblock.sh
        #!/bin/sh
      
        BLOCK_NR=$1
      
        # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
      
        # for i in $(seq 10); do sh probe_memblock.sh $i; done
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
        /sys/devices/system/memory/memory35/valid_zones:Normal Movable
        /sys/devices/system/memory/memory36/valid_zones:Normal Movable
        /sys/devices/system/memory/memory37/valid_zones:Normal Movable
        /sys/devices/system/memory/memory38/valid_zones:Normal Movable
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      The main difference to the original implementation is that all new
      memblocks can be both online_kernel and online_movable initially because
      there is no clash obviously.  For the comparison the original
      implementation would have
      
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal
        /sys/devices/system/memory/memory35/valid_zones:Normal
        /sys/devices/system/memory/memory36/valid_zones:Normal
        /sys/devices/system/memory/memory37/valid_zones:Normal
        /sys/devices/system/memory/memory38/valid_zones:Normal
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      Now
        # echo online_movable > /sys/devices/system/memory/memory34/state
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
        /sys/devices/system/memory/memory36/valid_zones:Movable
        /sys/devices/system/memory/memory37/valid_zones:Movable
        /sys/devices/system/memory/memory38/valid_zones:Movable
        /sys/devices/system/memory/memory39/valid_zones:Movable
      
      Block 33 can still be online both kernel and movable while all
      the remaining can be only movable.
      
      /proc/zonelist says
        Node 0, zone   Normal
          pages free     0
                min      0
                low      0
                high     0
                spanned  0
                present  0
        --
        Node 0, zone  Movable
          pages free     32753
                min      85
                low      117
                high     149
                spanned  32768
                present  32768
      
      A new memblock at a lower address will result in a new memblock (32)
      which will still allow both Normal and Movable.
      
        # sh probe_memblock.sh 0
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      and online_kernel will convert it to the zone normal properly
      while 33 can be still onlined both ways.
      
        # echo online_kernel > /sys/devices/system/memory/memory32/state
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     65441
                min      165
                low      230
                high     295
                spanned  65536
                present  65536
        --
        Node 0, zone  Movable
          pages free     32740
                min      82
                low      114
                high     146
                spanned  32768
                present  32768
      
      so both zones have one memblock spanned and present.
      
      Onlining 39 should associate this block to the movable zone
      
        # echo online > /sys/devices/system/memory/memory39/state
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     32765
                min      80
                low      112
                high     144
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     65501
                min      160
                low      225
                high     290
                spanned  196608
                present  65536
      
      so we will have a movable zone which spans 6 memblocks, 2 present and 4
      representing a hole.
      
      Offlining both movable blocks will lead to the zone with no present
      pages which is the expected behavior I believe.
      
        # echo offline > /sys/devices/system/memory/memory39/state
        # echo offline > /sys/devices/system/memory/memory34/state
        # grep -A6 "Movable\|Normal" /proc/zoneinfo
        Node 0, zone   Normal
          pages free     32735
                min      90
                low      122
                high     154
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     0
                min      0
                low      0
                high     0
                spanned  196608
                present  0
      
      As a bonus we will get a nice cleanup in the memory hotplug codebase.
      
      This patch (of 16):
      
      init_currently_empty_zone doesn't have any error to return yet it is
      still an int and callers try to be defensive and try to handle potential
      error.  Remove this nonsense and simplify all callers.
      
      This patch shouldn't have any visible effect
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0bbf3b
  2. 03 6月, 2017 2 次提交
    • M
      mm: consider memblock reservations for deferred memory initialization sizing · 864b9a39
      Michal Hocko 提交于
      We have seen an early OOM killer invocation on ppc64 systems with
      crashkernel=4096M:
      
      	kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=7, order=0, oom_score_adj=0
      	kthreadd cpuset=/ mems_allowed=7
      	CPU: 0 PID: 2 Comm: kthreadd Not tainted 4.4.68-1.gd7fe927-default #1
      	Call Trace:
      	  dump_stack+0xb0/0xf0 (unreliable)
      	  dump_header+0xb0/0x258
      	  out_of_memory+0x5f0/0x640
      	  __alloc_pages_nodemask+0xa8c/0xc80
      	  kmem_getpages+0x84/0x1a0
      	  fallback_alloc+0x2a4/0x320
      	  kmem_cache_alloc_node+0xc0/0x2e0
      	  copy_process.isra.25+0x260/0x1b30
      	  _do_fork+0x94/0x470
      	  kernel_thread+0x48/0x60
      	  kthreadd+0x264/0x330
      	  ret_from_kernel_thread+0x5c/0xa4
      
      	Mem-Info:
      	active_anon:0 inactive_anon:0 isolated_anon:0
      	 active_file:0 inactive_file:0 isolated_file:0
      	 unevictable:0 dirty:0 writeback:0 unstable:0
      	 slab_reclaimable:5 slab_unreclaimable:73
      	 mapped:0 shmem:0 pagetables:0 bounce:0
      	 free:0 free_pcp:0 free_cma:0
      	Node 7 DMA free:0kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:52428800kB managed:110016kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:320kB slab_unreclaimable:4672kB kernel_stack:1152kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      	lowmem_reserve[]: 0 0 0 0
      	Node 7 DMA: 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 0kB
      	0 total pagecache pages
      	0 pages in swap cache
      	Swap cache stats: add 0, delete 0, find 0/0
      	Free swap  = 0kB
      	Total swap = 0kB
      	819200 pages RAM
      	0 pages HighMem/MovableOnly
      	817481 pages reserved
      	0 pages cma reserved
      	0 pages hwpoisoned
      
      the reason is that the managed memory is too low (only 110MB) while the
      rest of the the 50GB is still waiting for the deferred intialization to
      be done.  update_defer_init estimates the initial memoty to initialize
      to 2GB at least but it doesn't consider any memory allocated in that
      range.  In this particular case we've had
      
      	Reserving 4096MB of memory at 128MB for crashkernel (System RAM: 51200MB)
      
      so the low 2GB is mostly depleted.
      
      Fix this by considering memblock allocations in the initial static
      initialization estimation.  Move the max_initialise to
      reset_deferred_meminit and implement a simple memblock_reserved_memory
      helper which iterates all reserved blocks and sums the size of all that
      start below the given address.  The cumulative size is than added on top
      of the initial estimation.  This is still not ideal because
      reset_deferred_meminit doesn't consider holes and so reservation might
      be above the initial estimation whihch we ignore but let's make the
      logic simpler until we really need to handle more complicated cases.
      
      Fixes: 3a80a7fa ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
      Link: http://lkml.kernel.org/r/20170531104010.GI27783@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      864b9a39
    • T
      mm/page_alloc.c: make sure OOM victim can try allocations with no watermarks once · c288983d
      Tetsuo Handa 提交于
      Roman Gushchin has reported that the OOM killer can trivially selects
      next OOM victim when a thread doing memory allocation from page fault
      path was selected as first OOM victim.
      
          allocate invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           __alloc_pages_slowpath+0xc84/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 492 (allocate) score 899 or sacrifice child
          Killed process 492 (allocate) total-vm:2052368kB, anon-rss:1894576kB, file-rss:4kB, shmem-rss:0kB
          allocate: page allocation failure: order:0, mode:0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=(null)
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           __alloc_pages_slowpath+0xd32/0xd40
           __alloc_pages_nodemask+0x245/0x260
           alloc_pages_vma+0xa2/0x270
           __handle_mm_fault+0xca9/0x10c0
           handle_mm_fault+0xf3/0x210
           __do_page_fault+0x240/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          oom_reaper: reaped process 492 (allocate), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
          ...
          allocate invoked oom-killer: gfp_mask=0x0(), nodemask=(null),  order=0, oom_score_adj=0
          allocate cpuset=/ mems_allowed=0
          CPU: 1 PID: 492 Comm: allocate Not tainted 4.12.0-rc1-mm1+ #181
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
          Call Trace:
           oom_kill_process+0x219/0x3e0
           out_of_memory+0x11d/0x480
           pagefault_out_of_memory+0x68/0x80
           mm_fault_error+0x8f/0x190
           ? handle_mm_fault+0xf3/0x210
           __do_page_fault+0x4b2/0x4e0
           trace_do_page_fault+0x37/0xe0
           do_async_page_fault+0x19/0x70
           async_page_fault+0x28/0x30
          ...
          Out of memory: Kill process 233 (firewalld) score 10 or sacrifice child
          Killed process 233 (firewalld) total-vm:246076kB, anon-rss:20956kB, file-rss:0kB, shmem-rss:0kB
      
      There is a race window that the OOM reaper completes reclaiming the
      first victim's memory while nothing but mutex_trylock() prevents the
      first victim from calling out_of_memory() from pagefault_out_of_memory()
      after memory allocation for page fault path failed due to being selected
      as an OOM victim.
      
      This is a side effect of commit 9a67f648 ("mm: consolidate
      GFP_NOFAIL checks in the allocator slowpath") because that commit
      silently changed the behavior from
      
          /* Avoid allocations with no watermarks from looping endlessly */
      
      to
      
          /*
           * Give up allocations without trying memory reserves if selected
           * as an OOM victim
           */
      
      in __alloc_pages_slowpath() by moving the location to check TIF_MEMDIE
      flag.  I have noticed this change but I didn't post a patch because I
      thought it is an acceptable change other than noise by warn_alloc()
      because !__GFP_NOFAIL allocations are allowed to fail.  But we
      overlooked that failing memory allocation from page fault path makes
      difference due to the race window explained above.
      
      While it might be possible to add a check to pagefault_out_of_memory()
      that prevents the first victim from calling out_of_memory() or remove
      out_of_memory() from pagefault_out_of_memory(), changing
      pagefault_out_of_memory() does not suppress noise by warn_alloc() when
      allocating thread was selected as an OOM victim.  There is little point
      with printing similar backtraces and memory information from both
      out_of_memory() and warn_alloc().
      
      Instead, if we guarantee that current thread can try allocations with no
      watermarks once when current thread looping inside
      __alloc_pages_slowpath() was selected as an OOM victim, we can follow "who
      can use memory reserves" rules and suppress noise by warn_alloc() and
      prevent memory allocations from page fault path from calling
      pagefault_out_of_memory().
      
      If we take the comment literally, this patch would do
      
        -    if (test_thread_flag(TIF_MEMDIE))
        -        goto nopage;
        +    if (alloc_flags == ALLOC_NO_WATERMARKS || (gfp_mask & __GFP_NOMEMALLOC))
        +        goto nopage;
      
      because gfp_pfmemalloc_allowed() returns false if __GFP_NOMEMALLOC is
      given.  But if I recall correctly (I couldn't find the message), the
      condition is meant to apply to only OOM victims despite the comment.
      Therefore, this patch preserves TIF_MEMDIE check.
      
      Fixes: 9a67f648 ("mm: consolidate GFP_NOFAIL checks in the allocator slowpath")
      Link: http://lkml.kernel.org/r/201705192112.IAF69238.OQOHSJLFOFFMtV@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NRoman Gushchin <guro@fb.com>
      Tested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.11]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c288983d
  3. 09 5月, 2017 5 次提交
    • V
      mm: introduce memalloc_noreclaim_{save,restore} · 499118e9
      Vlastimil Babka 提交于
      The previous patch ("mm: prevent potential recursive reclaim due to
      clearing PF_MEMALLOC") has shown that simply setting and clearing
      PF_MEMALLOC in current->flags can result in wrongly clearing a
      pre-existing PF_MEMALLOC flag and potentially lead to recursive reclaim.
      Let's introduce helpers that support proper nesting by saving the
      previous stat of the flag, similar to the existing memalloc_noio_* and
      memalloc_nofs_* helpers.  Convert existing setting/clearing of
      PF_MEMALLOC within mm to the new helpers.
      
      There are no known issues with the converted code, but the change makes
      it more robust.
      
      Link: http://lkml.kernel.org/r/20170405074700.29871-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      499118e9
    • V
      mm: prevent potential recursive reclaim due to clearing PF_MEMALLOC · 62be1511
      Vlastimil Babka 提交于
      Patch series "more robust PF_MEMALLOC handling"
      
      This series aims to unify the setting and clearing of PF_MEMALLOC, which
      prevents recursive reclaim.  There are some places that clear the flag
      unconditionally from current->flags, which may result in clearing a
      pre-existing flag.  This already resulted in a bug report that Patch 1
      fixes (without the new helpers, to make backporting easier).  Patch 2
      introduces the new helpers, modelled after existing memalloc_noio_* and
      memalloc_nofs_* helpers, and converts mm core to use them.  Patches 3
      and 4 convert non-mm code.
      
      This patch (of 4):
      
      __alloc_pages_direct_compact() sets PF_MEMALLOC to prevent deadlock
      during page migration by lock_page() (see the comment in
      __unmap_and_move()).  Then it unconditionally clears the flag, which can
      clear a pre-existing PF_MEMALLOC flag and result in recursive reclaim.
      This was not a problem until commit a8161d1e ("mm, page_alloc:
      restructure direct compaction handling in slowpath"), because direct
      compation was called only after direct reclaim, which was skipped when
      PF_MEMALLOC flag was set.
      
      Even now it's only a theoretical issue, as the new callsite of
      __alloc_pages_direct_compact() is reached only for costly orders and
      when gfp_pfmemalloc_allowed() is true, which means either
      __GFP_NOMEMALLOC is in gfp_flags or in_interrupt() is true.  There is no
      such known context, but let's play it safe and make
      __alloc_pages_direct_compact() robust for cases where PF_MEMALLOC is
      already set.
      
      Fixes: a8161d1e ("mm, page_alloc: restructure direct compaction handling in slowpath")
      Link: http://lkml.kernel.org/r/20170405074700.29871-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Boris Brezillon <boris.brezillon@free-electrons.com>
      Cc: Chris Leech <cleech@redhat.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Lee Duncan <lduncan@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62be1511
    • V
      mm, compaction: restrict async compaction to pageblocks of same migratetype · 282722b0
      Vlastimil Babka 提交于
      The migrate scanner in async compaction is currently limited to
      MIGRATE_MOVABLE pageblocks.  This is a heuristic intended to reduce
      latency, based on the assumption that non-MOVABLE pageblocks are
      unlikely to contain movable pages.
      
      However, with the exception of THP's, most high-order allocations are
      not movable.  Should the async compaction succeed, this increases the
      chance that the non-MOVABLE allocations will fallback to a MOVABLE
      pageblock, making the long-term fragmentation worse.
      
      This patch attempts to help the situation by changing async direct
      compaction so that the migrate scanner only scans the pageblocks of the
      requested migratetype.  If it's a non-MOVABLE type and there are such
      pageblocks that do contain movable pages, chances are that the
      allocation can succeed within one of such pageblocks, removing the need
      for a fallback.  If that fails, the subsequent sync attempt will ignore
      this restriction.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 30%.  The number of movable allocations falling back is reduced by
      12%.
      
      Link: http://lkml.kernel.org/r/20170307131545.28577-8-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      282722b0
    • V
      mm, page_alloc: count movable pages when stealing from pageblock · 02aa0cdd
      Vlastimil Babka 提交于
      When stealing pages from pageblock of a different migratetype, we count
      how many free pages were stolen, and change the pageblock's migratetype
      if more than half of the pageblock was free.  This might be too
      conservative, as there might be other pages that are not free, but were
      allocated with the same migratetype as our allocation requested.
      
      While we cannot determine the migratetype of allocated pages precisely
      (at least without the page_owner functionality enabled), we can count
      pages that compaction would try to isolate for migration - those are
      either on LRU or __PageMovable().  The rest can be assumed to be
      MIGRATE_RECLAIMABLE or MIGRATE_UNMOVABLE, which we cannot easily
      distinguish.  This counting can be done as part of free page stealing
      with little additional overhead.
      
      The page stealing code is changed so that it considers free pages plus
      pages of the "good" migratetype for the decision whether to change
      pageblock's migratetype.
      
      The result should be more accurate migratetype of pageblocks wrt the
      actual pages in the pageblocks, when stealing from semi-occupied
      pageblocks.  This should help the efficiency of page grouping by
      mobility.
      
      In testing based on 4.9 kernel with stress-highalloc from mmtests
      configured for order-4 GFP_KERNEL allocations, this patch has reduced
      the number of unmovable allocations falling back to movable pageblocks
      by 47%.  The number of movable allocations falling back to other
      pageblocks are increased by 55%, but these events don't cause permanent
      fragmentation, so the tradeoff should be positive.  Later patches also
      offset the movable fallback increase to some extent.
      
      [akpm@linux-foundation.org: merge fix]
      Link: http://lkml.kernel.org/r/20170307131545.28577-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02aa0cdd
    • V
      mm, page_alloc: split smallest stolen page in fallback · 3bc48f96
      Vlastimil Babka 提交于
      The __rmqueue_fallback() function is called when there's no free page of
      requested migratetype, and we need to steal from a different one.
      
      There are various heuristics to make this event infrequent and reduce
      permanent fragmentation.  The main one is to try stealing from a
      pageblock that has the most free pages, and possibly steal them all at
      once and convert the whole pageblock.  Precise searching for such
      pageblock would be expensive, so instead the heuristics walks the free
      lists from MAX_ORDER down to requested order and assumes that the block
      with highest-order free page is likely to also have the most free pages
      in total.
      
      Chances are that together with the highest-order page, we steal also
      pages of lower orders from the same block.  But then we still split the
      highest order page.  This is wasteful and can contribute to
      fragmentation instead of avoiding it.
      
      This patch thus changes __rmqueue_fallback() to just steal the page(s)
      and put them on the freelist of the requested migratetype, and only
      report whether it was successful.  Then we pick (and eventually split)
      the smallest page with __rmqueue_smallest().  This all happens under
      zone lock, so nobody can steal it from us in the process.  This should
      reduce fragmentation due to fallbacks.  At worst we are only stealing a
      single highest-order page and waste some cycles by moving it between
      lists and then removing it, but fallback is not exactly hot path so that
      should not be a concern.  As a side benefit the patch removes some
      duplicate code by reusing __rmqueue_smallest().
      
      [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
        Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
      Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bc48f96
  4. 04 5月, 2017 8 次提交
    • T
      mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc() · 0f7896f1
      Tetsuo Handa 提交于
      Commit c0a32fc5 ("mm: more intensive memory corruption debugging")
      changed to check debug_guardpage_minorder() > 0 when reporting
      allocation failures.  The reasoning was
      
        When we use guard page to debug memory corruption, it shrinks
        available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
        value. In such case memory allocation failures can be common and
        printing errors can flood dmesg. If somebody debug corruption,
        allocation failures are not the things he/she is interested about.
      
      but this is misguided.
      
      Allocation requests with __GFP_NOWARN flag by definition do not cause
      flooding of allocation failure messages.  Allocation requests with
      __GFP_NORETRY flag likely also have __GFP_NOWARN flag.  Costly
      allocation requests likely also have __GFP_NOWARN flag.
      
      Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
      __GFP_NOWARN flag or __GFP_HIGH flag.  Non-costly allocation requests
      with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
      small to fail" memory-allocation rule.
      
      Therefore, as a whole, shrinking available pages by
      debug_guardpage_minorder= kernel boot parameter might cause flooding of
      OOM killer messages but unlikely causes flooding of allocation failure
      messages.  Let's remove debug_guardpage_minorder() > 0 check which would
      likely be pointless.
      
      Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f7896f1
    • V
      mm: enable page poisoning early at boot · bd33ef36
      Vinayak Menon 提交于
      On SPARSEMEM systems page poisoning is enabled after buddy is up,
      because of the dependency on page extension init.  This causes the pages
      released by free_all_bootmem not to be poisoned.  This either delays or
      misses the identification of some issues because the pages have to
      undergo another cycle of alloc-free-alloc for any corruption to be
      detected.
      
      Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
      flag.  Since all the free pages will now be poisoned, the flag need not
      be verified before checking the poison during an alloc.
      
      [vinmenon@codeaurora.org: fix Kconfig]
        Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
      Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.orgSigned-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd33ef36
    • J
      mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings · 82251963
      Johannes Weiner 提交于
      __GFP_NOWARN, which is usually added to avoid warnings from callsites
      that expect to fail and have fallbacks, currently also suppresses
      allocation stall warnings.  These trigger when an allocation is stuck
      inside the allocator for 10 seconds or longer.
      
      But there is no class of allocations that can get legitimately stuck in
      the allocator for this long.  This always indicates a problem.
      
      Always emit stall warnings.  Restrict __GFP_NOWARN to alloc failures.
      
      Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82251963
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • J
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner 提交于
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  5. 21 4月, 2017 1 次提交
  6. 08 4月, 2017 2 次提交
  7. 04 4月, 2017 1 次提交
    • S
      ftrace: Have init/main.c call ftrace directly to free init memory · b80f0f6c
      Steven Rostedt (VMware) 提交于
      Relying on free_reserved_area() to call ftrace to free init memory proved to
      not be sufficient. The issue is that on x86, when debug_pagealloc is
      enabled, the init memory is not freed, but simply set as not present. Since
      ftrace was uninformed of this, starting function tracing still tries to
      update pages that are not present according to the page tables, causing
      ftrace to bug, as well as killing the kernel itself.
      
      Instead of relying on free_reserved_area(), have init/main.c call ftrace
      directly just before it frees the init memory. Then it needs to use
      __init_begin and __init_end to know where the init memory location is.
      Looking at all archs (and testing what I can), it appears that this should
      work for each of them.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b80f0f6c
  8. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  9. 25 3月, 2017 1 次提交
  10. 09 3月, 2017 1 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
  11. 02 3月, 2017 1 次提交
  12. 28 2月, 2017 1 次提交
  13. 25 2月, 2017 13 次提交
    • W
      mm/page_alloc.c: remove redundant init code for ZONE_MOVABLE · ad69444e
      Wei Yang 提交于
      arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE]
      is skipped in the loop.  No need to reset them to 0 again.
      
      This patch just removes the redundant code.
      
      Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad69444e
    • G
      mm/page_alloc: fix nodes for reclaim in fast path · e02dc017
      Gavin Shan 提交于
      When @node_reclaim_node isn't 0, the page allocator tries to reclaim
      pages if the amount of free memory in the zones are below the low
      watermark.  On Power platform, none of NUMA nodes are scanned for page
      reclaim because no nodes match the condition in zone_allows_reclaim().
      On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
      of Node-A to Node-A.  So the preferred node even won't be scanned for
      page reclaim.
      
         __alloc_pages_nodemask()
         get_page_from_freelist()
            zone_allows_reclaim()
      
      Anton proposed the test code as below:
      
         # cat alloc.c
            :
         int main(int argc, char *argv[])
         {
      	void *p;
      	unsigned long size;
      	unsigned long start, end;
      
      	start = time(NULL);
      	size = strtoul(argv[1], NULL, 0);
      	printf("To allocate %ldGB memory\n", size);
      
      	size <<= 30;
      	p = malloc(size);
      	assert(p);
      	memset(p, 0, size);
      
      	end = time(NULL);
      	printf("Used time: %ld seconds\n", end - start);
      	sleep(3600);
      	return 0;
         }
      
      The system I use for testing has two NUMA nodes.  Both have 128GB
      memory.  In below scnario, the page caches on node#0 should be reclaimed
      when it encounters pressure to accommodate request of allocation.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches; \
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619712 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619840 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:          186816 kB
      
      With the patch applied, the pagecache on node-0 is reclaimed when its
      free memory is running out.  It's the expected behaviour.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33605568 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:        1379520 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:           317120 kB
      
      Fixes: 5f7a75ac ("mm: page_alloc: do not cache reclaim distances")
      Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.comSigned-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02dc017
    • M
    • L
      mm: alloc_contig_range: allow to specify GFP mask · ca96b625
      Lucas Stach 提交于
      Currently alloc_contig_range assumes that the compaction should be done
      with the default GFP_KERNEL flags.  This is probably right for all
      current uses of this interface, but may change as CMA is used in more
      use-cases (including being the default DMA memory allocator on some
      platforms).
      
      Change the function prototype, to allow for passing through the GFP mask
      set by upper layers.
      
      Also respect global restrictions by applying memalloc_noio_flags to the
      passed in flags.
      
      Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.deSigned-off-by: NLucas Stach <l.stach@pengutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alexander Graf <agraf@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca96b625
    • Y
      mm/hotplug: enable memory hotplug for non-lru movable pages · 0efadf48
      Yisheng Xie 提交于
      We had considered all of the non-lru pages as unmovable before commit
      bda807d4 ("mm: migrate: support non-lru movable page migration").
      But now some of non-lru pages like zsmalloc, virtio-balloon pages also
      become movable.  So we can offline such blocks by using non-lru page
      migration.
      
      This patch straightforwardly adds non-lru migration code, which means
      adding non-lru related code to the functions which scan over pfn and
      collect pages to be migrated and isolate them before migration.
      Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0efadf48
    • M
      mm, page_alloc: use static global work_struct for draining per-cpu pages · bd233f53
      Mel Gorman 提交于
      As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
      work_struct to co-ordinate the draining of per-cpu pages on the
      workqueue.  Only one task can drain at a time but this is better than
      the previous scheme that allowed multiple tasks to send IPIs at a time.
      
      One consideration is whether parallel requests should synchronise
      against each other.  This patch does not synchronise for a global drain
      as the common case for such callers is expected to be multiple parallel
      direct reclaimers competing for pages when the watermark is close to
      min.  Draining the per-cpu list is unlikely to make much progress and
      serialising the drain is of dubious merit.  Drains are synchonrised for
      callers such as memory hotplug and CMA that care about the drain being
      complete when the function returns.
      
      Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd233f53
    • V
      mm, page_alloc: don't check cpuset allowed twice in fast-path · 51047820
      Vlastimil Babka 提交于
      Since commit 682a3385 ("mm, page_alloc: inline the fast path of the
      zonelist iterator") we replace a NULL nodemask with
      cpuset_current_mems_allowed in the fast path, so that
      get_page_from_freelist() filters nodes allowed by the cpuset via
      for_next_zone_zonelist_nodemask().
      
      In that case it's pointless to additionaly check __cpuset_zone_allowed()
      in each iteration, which we can avoid by not adding ALLOC_CPUSET to
      alloc_flags in that scenario.
      
      This saves some cycles in the allocator fast path on systems with one or
      more non-root cpuset configured.  In the slow path, ALLOC_CPUSET is
      reset according to __alloc_pages_slowpath().  Without configured
      cpusets, this code is disabled by a static key.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51047820
    • V
      mm, page_alloc: remove redundant checks from alloc fastpath · df76cee6
      Vlastimil Babka 提交于
      The allocation fast path contains two similar checks for zoneref->zone
      being NULL, where zoneref points either to the first zone in the
      zonelist, or to the preferred zone.  These can be NULL either due to
      empty zonelist, or no zone being compatible with given nodemask or
      task's cpuset.
      
      These checks are unnecessary, because the zonelist walks in
      first_zones_zonelist() and get_page_from_freelist() handle a NULL
      starting zoneref->zone or preferred_zoneref->zone safely.  It's safe to
      fallback to __alloc_pages_slowpath() where we also have the check early
      enough.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df76cee6
    • M
      mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
      Mel Gorman 提交于
      Many workloads that allocate pages are not handling an interrupt at a
      time.  As allocation requests may be from IRQ context, it's necessary to
      disable/enable IRQs for every page allocation.  This cost is the bulk of
      the free path but also a significant percentage of the allocation path.
      
      This patch alters the locking and checks such that only irq-safe
      allocation requests use the per-cpu allocator.  All others acquire the
      irq-safe zone->lock and allocate from the buddy allocator.  It relies on
      disabling preemption to safely access the per-cpu structures.  It could
      be slightly modified to avoid soft IRQs using it but it's not clear it's
      worthwhile.
      
      This modification may slow allocations from IRQ context slightly but the
      main gain from the per-cpu allocator is that it scales better for
      allocations from multiple contexts.  There is an implicit assumption
      that intensive allocations from IRQ contexts on multiple CPUs from a
      single NUMA node are rare and that the fast majority of scaling issues
      are encountered in !IRQ contexts such as page faulting.  It's worth
      noting that this patch is not required for a bulk page allocator but it
      significantly reduces the overhead.
      
      The following is results from a page allocator micro-benchmark.  Only
      order-0 is interesting as higher orders do not use the per-cpu allocator
      
                                                4.10.0-rc2                 4.10.0-rc2
                                                   vanilla               irqsafe-v1r5
      Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
      Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
      Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
      Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
      Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
      Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
      Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
      Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
      Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
      Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
      Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
      Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
      Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
      Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
      Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
      Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
      Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
      Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
      Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
      Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
      Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
      Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
      Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
      Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
      Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
      Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
      Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
      Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
      Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
      Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
      Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
      Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
      Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
      Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
      Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
      Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
      Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
      Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
      Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
      Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
      Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
      Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
      Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
      Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
      Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
      
      This is the alloc, free and total overhead of allocating order-0 pages
      in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
      overhead massively reduces overhead.  Alloc overhead is roughly reduced
      by 14-20% in most cases.  The free path is reduced by 26-46% and the
      total reduction is significant.
      
      Many users require zeroing of pages from the page allocator which is the
      vast cost of allocation.  Hence, the impact on a basic page faulting
      benchmark is not that significant
      
                                    4.10.0-rc2            4.10.0-rc2
                                       vanilla          irqsafe-v1r5
      Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
      Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
      Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
      Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
      CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
      CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
      Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
      Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
      
      This is from aim9 and the most notable outcome is that fault variability
      is reduced by the patch.  The headline improvement is small as the
      overall fault cost, zeroing, page table insertion etc dominate relative
      to disabling/enabling IRQs in the per-cpu allocator.
      
      Similarly, little benefit was seen on networking benchmarks both
      localhost and between physical server/clients where other costs
      dominate.  It's possible that this will only be noticable on very high
      speed networks.
      
      Jesper Dangaard Brouer independently tested this with a separate
      microbenchmark from
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      Micro-benchmarked with [1] page_bench02:
       modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
        rmmod page_bench02 ; dmesg --notime | tail -n 4
      
      Compared to baseline: 213 cycles(tsc) 53.417 ns
       - against this     : 184 cycles(tsc) 46.056 ns
       - Saving           : -29 cycles
       - Very close to expected 27 cycles saving [see below [2]]
      
      Micro benchmarking via time_bench_sample[3], we get the cost of these
      operations:
      
       time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
       time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
       time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
       time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
       time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
       time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
       time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
       time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
       time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
       [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
       time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
       [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
       time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
       time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
       time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
      
      Thus, expected improvement is: 38-11 = 27 cycles.
      
      [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
        Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      374ad05a
    • M
      mm, page_alloc: do not depend on cpu hotplug locks inside the allocator · a459eeb7
      Michal Hocko 提交于
      Dmitry has reported the following lockdep splat
        lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
        __mutex_lock_common kernel/locking/mutex.c:521 [inline]
        mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
        pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
        __alloc_percpu+0x24/0x30 mm/percpu.c:1075
        smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
        cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
        cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
        _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      pcpu_alloc
        pcpu_alloc_mutex
      
        get_online_cpus+0x62/0x90 kernel/cpu.c:248
        drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
        __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
        __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
        __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
        __alloc_pages include/linux/gfp.h:426 [inline]
        __alloc_pages_node include/linux/gfp.h:439 [inline]
        alloc_pages_node include/linux/gfp.h:453 [inline]
        pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
        pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
        pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
        __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
        bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
        array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
        find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
        map_create kernel/bpf/syscall.c:188 [inline]
        SYSC_bpf kernel/bpf/syscall.c:870 [inline]
        SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
        entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      pcpu_alloc
        pcpu_alloc_mutex
      drain_all_pages
        get_online_cpus
          cpu_hotplug.lock
      
        cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
        _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      
      Pulling cpu hotplug locks inside the page allocator is just too
      dangerous.  Let's remove the dependency by dropping get_online_cpus()
      from drain_all_pages.  This is not so simple though because now we do
      not have a protection against cpu hotplug which means 2 things:
      
        - the work item might be executed on a different cpu in worker from
          unbound pool so it doesn't run on pinned on the cpu
      
        - we have to make sure that we do not race with page_alloc_cpu_dead
          calling drain_pages_zone
      
      Disabling preemption in drain_local_pages_wq will solve the first
      problem drain_local_pages will determine its local CPU from the WQ
      context which will be stable after that point, page_alloc_cpu_dead is
      pinned to the CPU already.  The later condition is achieved by disabling
      IRQs in drain_pages_zone.
      
      Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
      Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a459eeb7
    • M
      mm, page_alloc: drain per-cpu pages from workqueue context · 0ccce3b9
      Mel Gorman 提交于
      The per-cpu page allocator can be drained immediately via
      drain_all_pages() which sends IPIs to every CPU.  In the next patch, the
      per-cpu allocator will only be used for interrupt-safe allocations which
      prevents draining it from IPI context.  This patch uses workqueues to
      drain the per-cpu lists instead.
      
      This is slower but no slowdown during intensive reclaim was measured and
      the paths that use drain_all_pages() are not that sensitive to
      performance.  This is particularly true as the path would only be
      triggered when reclaim is failing.  It also makes a some sense to avoid
      storming a machine with IPIs when it's under memory pressure.  Arguably,
      it should be further adjusted so that only one caller at a time is
      draining pages but it's beyond the scope of the current patch.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccce3b9
    • M
      mm, page_alloc: split alloc_pages_nodemask() · 9cd75558
      Mel Gorman 提交于
      alloc_pages_nodemask does a number of preperation steps that determine
      what zones can be used for the allocation depending on a variety of
      factors.  This is fine but a hypothetical caller that wanted multiple
      order-0 pages has to do the preparation steps multiple times.  This
      patch structures __alloc_pages_nodemask such that it's relatively easy
      to build a bulk order-0 page allocator.  There is no functional change.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cd75558
    • M
      mm, page_alloc: split buffered_rmqueue() · 066b2393
      Mel Gorman 提交于
      Patch series "Use per-cpu allocator for !irq requests and prepare for a
      bulk allocator", v5.
      
      This series is motivated by a conversation led by Jesper Dangaard Brouer
      at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
      Part of his motivation was due to the overhead of allocating multiple
      order-0 that led some drivers to use high-order allocations and
      splitting them.  This is very slow in some cases.
      
      The first two patches in this series restructure the page allocator such
      that it is relatively easy to introduce an order-0 bulk page allocator.
      A patch exists to do that and has been handed over to Jesper until an
      in-kernel users is created.  The third patch prevents the per-cpu
      allocator being drained from IPI context as that can potentially corrupt
      the list after patch four is merged.  The final patch alters the per-cpu
      alloctor to make it exclusive to !irq requests.  This cuts
      allocation/free overhead by roughly 30%.
      
      Performance tests from both Jesper and me are included in the patch.
      
      This patch (of 4):
      
      buffered_rmqueue removes a page from a given zone and uses the per-cpu
      list for order-0.  This is fine but a hypothetical caller that wanted
      multiple order-0 pages has to disable/reenable interrupts multiple
      times.  This patch structures buffere_rmqueue such that it's relatively
      easy to build a bulk order-0 page allocator.  There is no functional
      change.
      
      [mgorman@techsingularity.net: failed per-cpu refill may blow up]
        Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      066b2393
  14. 23 2月, 2017 1 次提交