1. 04 5月, 2017 1 次提交
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  2. 20 4月, 2017 1 次提交
    • M
      mm: make mm_percpu_wq non freezable · 80d136e1
      Michal Hocko 提交于
      Geert has reported a freeze during PM resume and some additional
      debugging has shown that the device_resume worker cannot make a forward
      progress because it waits for an event which is stuck waiting in
      drain_all_pages:
      
        INFO: task kworker/u4:0:5 blocked for more than 120 seconds.
              Not tainted 4.11.0-rc7-koelsch-00029-g005882e5-dirty #3476
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        kworker/u4:0    D    0     5      2 0x00000000
        Workqueue: events_unbound async_run_entry_fn
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          dpm_wait_for_superior
          device_resume
          async_resume
          async_run_entry_fn
          process_one_work
          worker_thread
          kthread
        [...]
        bash            D    0  1703   1694 0x00000000
          __schedule
          schedule
          schedule_timeout
          wait_for_common
          flush_work
          drain_all_pages
          start_isolate_page_range
          alloc_contig_range
          cma_alloc
          __alloc_from_contiguous
          cma_allocator_alloc
          __dma_alloc
          arm_dma_alloc
          sh_eth_ring_init
          sh_eth_open
          sh_eth_resume
          dpm_run_callback
          device_resume
          dpm_resume
          dpm_resume_end
          suspend_devices_and_enter
          pm_suspend
          state_store
          kernfs_fop_write
          __vfs_write
          vfs_write
          SyS_write
        [...]
        Showing busy workqueues and worker pools:
        [...]
        workqueue mm_percpu_wq: flags=0xc
          pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq, vmstat_update
          pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=0/0
            delayed: drain_local_pages_wq BAR(1703), vmstat_update
      
      Tetsuo has properly noted that mm_percpu_wq is created as WQ_FREEZABLE
      so it is frozen this early during resume so we are effectively
      deadlocked.  Fix this by dropping WQ_FREEZABLE when creating
      mm_percpu_wq.  We really want to have it operational all the time.
      
      Fixes: ce612879 ("mm: move pcp and lru-pcp draining into single wq")
      Reported-and-tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80d136e1
  3. 08 4月, 2017 1 次提交
  4. 01 4月, 2017 1 次提交
    • M
      mm: move mm_percpu_wq initialization earlier · 597b7305
      Michal Hocko 提交于
      Yang Li has reported that drain_all_pages triggers a WARN_ON which means
      that this function is called earlier than the mm_percpu_wq is
      initialized on arm64 with CMA configured:
      
        WARNING: CPU: 2 PID: 1 at mm/page_alloc.c:2423 drain_all_pages+0x244/0x25c
        Modules linked in:
        CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.11.0-rc1-next-20170310-00027-g64dfbc5 #127
        Hardware name: Freescale Layerscape 2088A RDB Board (DT)
        task: ffffffc07c4a6d00 task.stack: ffffffc07c4a8000
        PC is at drain_all_pages+0x244/0x25c
        LR is at start_isolate_page_range+0x14c/0x1f0
        [...]
         drain_all_pages+0x244/0x25c
         start_isolate_page_range+0x14c/0x1f0
         alloc_contig_range+0xec/0x354
         cma_alloc+0x100/0x1fc
         dma_alloc_from_contiguous+0x3c/0x44
         atomic_pool_init+0x7c/0x208
         arm64_dma_init+0x44/0x4c
         do_one_initcall+0x38/0x128
         kernel_init_freeable+0x1a0/0x240
         kernel_init+0x10/0xfc
         ret_from_fork+0x10/0x20
      
      Fix this by moving the whole setup_vmstat which is an initcall right now
      to init_mm_internals which will be called right after the WQ subsystem
      is initialized.
      
      Link: http://lkml.kernel.org/r/20170315164021.28532-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NYang Li <pku.leo@gmail.com>
      Tested-by: NXiaolong Ye <xiaolong.ye@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      597b7305
  5. 10 3月, 2017 1 次提交
  6. 23 2月, 2017 1 次提交
  7. 02 12月, 2016 3 次提交
  8. 08 10月, 2016 4 次提交
    • J
      seq/proc: modify seq_put_decimal_[u]ll to take a const char *, not char · 75ba1d07
      Joe Perches 提交于
      Allow some seq_puts removals by taking a string instead of a single
      char.
      
      [akpm@linux-foundation.org: update vmstat_show(), per Joe]
      Link: http://lkml.kernel.org/r/667e1cf3d436de91a5698170a1e98d882905e956.1470704995.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ba1d07
    • A
      proc: much faster /proc/vmstat · 68ba0326
      Alexey Dobriyan 提交于
      Every current KDE system has process named ksysguardd polling files
      below once in several seconds:
      
      	$ strace -e trace=open -p $(pidof ksysguardd)
      	Process 1812 attached
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/etc/mtab", O_RDONLY|O_CLOEXEC)   = 8
      	open("/proc/net/dev", O_RDONLY)         = 8
      	open("/proc/net/wireless", O_RDONLY)    = -1 ENOENT (No such file or directory)
      	open("/proc/stat", O_RDONLY)            = 8
      	open("/proc/vmstat", O_RDONLY)          = 8
      
      Hell knows what it is doing but speed up reading /proc/vmstat by 33%!
      
      Benchmark is open+read+close 1.000.000 times.
      
      			BEFORE
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
            13146.768464      task-clock (msec)         #    0.960 CPUs utilized            ( +-  0.60% )
                      15      context-switches          #    0.001 K/sec                    ( +-  1.41% )
                       1      cpu-migrations            #    0.000 K/sec                    ( +- 11.11% )
                     104      page-faults               #    0.008 K/sec                    ( +-  0.57% )
          45,489,799,349      cycles                    #    3.460 GHz                      ( +-  0.03% )
           9,970,175,743      stalled-cycles-frontend   #   21.92% frontend cycles idle     ( +-  0.10% )
           2,800,298,015      stalled-cycles-backend    #   6.16% backend cycles idle       ( +-  0.32% )
          79,241,190,850      instructions              #    1.74  insn per cycle
                                                        #    0.13  stalled cycles per insn  ( +-  0.00% )
          17,616,096,146      branches                  # 1339.956 M/sec                    ( +-  0.00% )
             176,106,232      branch-misses             #    1.00% of all branches          ( +-  0.18% )
      
            13.691078109 seconds time elapsed                                          ( +-  0.03% )
            ^^^^^^^^^^^^
      
      			AFTER
      $ perf stat -r 10 taskset -c 3 ./proc-vmstat
      
       Performance counter stats for 'taskset -c 3 ./proc-vmstat' (10 runs):
      
             8688.353749      task-clock (msec)         #    0.950 CPUs utilized            ( +-  1.25% )
                      10      context-switches          #    0.001 K/sec                    ( +-  2.13% )
                       1      cpu-migrations            #    0.000 K/sec
                     104      page-faults               #    0.012 K/sec                    ( +-  0.56% )
          30,384,010,730      cycles                    #    3.497 GHz                      ( +-  0.07% )
          12,296,259,407      stalled-cycles-frontend   #   40.47% frontend cycles idle     ( +-  0.13% )
           3,370,668,651      stalled-cycles-backend    #  11.09% backend cycles idle       ( +-  0.69% )
          28,969,052,879      instructions              #    0.95  insn per cycle
                                                        #    0.42  stalled cycles per insn  ( +-  0.01% )
           6,308,245,891      branches                  #  726.058 M/sec                    ( +-  0.00% )
             214,685,502      branch-misses             #    3.40% of all branches          ( +-  0.26% )
      
             9.146081052 seconds time elapsed                                          ( +-  0.07% )
             ^^^^^^^^^^^
      
      vsnprintf() is slow because:
      
      1. format_decode() is busy looking for format specifier: 2 branches
         per character (not in this case, but in others)
      
      2. approximately million branches while parsing format mini language
         and everywhere
      
      3.  just look at what string() does /proc/vmstat is good case because
         most of its content are strings
      
      Link: http://lkml.kernel.org/r/20160806125455.GA1187@p183.telecom.bySigned-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68ba0326
    • T
      cpu: fix node state for whether it contains CPU · 03e86dba
      Tim Chen 提交于
      In current kernel code, we only call node_set_state(cpu_to_node(cpu),
      N_CPU) when a cpu is hot plugged.  But we do not set the node state for
      N_CPU when the cpus are brought online during boot.
      
      So this could lead to failure when we check to see if a node contains
      cpu with node_state(node_id, N_CPU).
      
      One use case is in the node_reclaime function:
      
              /*
               * Only run node reclaim on the local node or on nodes that do
               * not
               * have associated processors. This will favor the local
               * processor
               * over remote processors and spread off node memory allocations
               * as wide as possible.
               */
              if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
                      numa_node_id())
                      return NODE_RECLAIM_NOSCAN;
      
      I instrumented the kernel to call this function after boot and it always
      returns 0 on a x86 desktop machine until I apply the attached patch.
      
         int num_cpu_node(void)
         {
             int i, nr_cpu_nodes = 0;
      
             for_each_node(i) {
                     if (node_state(i, N_CPU))
                             ++ nr_cpu_nodes;
             }
      
             return nr_cpu_nodes;
         }
      
      Fix this by checking each node for online CPU when we initialize
      vmstat that's responsible for maintaining node state.
      
      Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.comSigned-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: <Huang@linux.intel.com>
      Cc: Ying <ying.huang@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e86dba
    • J
      mm/page_owner: move page_owner specific function to page_owner.c · e2f612e6
      Joonsoo Kim 提交于
      There is no reason that page_owner specific function resides on
      vmstat.c.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2f612e6
  9. 29 7月, 2016 12 次提交
    • M
      mm: remove reclaim and compaction retry approximations · 5a1c84b4
      Mel Gorman 提交于
      If per-zone LRU accounting is available then there is no point
      approximating whether reclaim and compaction should retry based on pgdat
      statistics.  This is effectively a revert of "mm, vmstat: remove zone
      and node double accounting by approximating retries" with the difference
      that inactive/active stats are still available.  This preserves the
      history of why the approximation was retried and why it had to be
      reverted to handle OOM kills on 32-bit systems.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a1c84b4
    • M
      mm: add per-zone lru list stat · 71c799f4
      Minchan Kim 提交于
      When I did stress test with hackbench, I got OOM message frequently
      which didn't ever happen in zone-lru.
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        ..
        ..
         __alloc_pages_nodemask+0xe52/0xe60
         ? new_slab+0x39c/0x3b0
         new_slab+0x39c/0x3b0
         ___slab_alloc.constprop.87+0x6da/0x840
         ? __alloc_skb+0x3c/0x260
         ? _raw_spin_unlock_irq+0x27/0x60
         ? trace_hardirqs_on_caller+0xec/0x1b0
         ? finish_task_switch+0xa6/0x220
         ? poll_select_copy_remaining+0x140/0x140
         __slab_alloc.isra.81.constprop.86+0x40/0x6d
         ? __alloc_skb+0x3c/0x260
         kmem_cache_alloc+0x22c/0x260
         ? __alloc_skb+0x3c/0x260
         __alloc_skb+0x3c/0x260
         alloc_skb_with_frags+0x4e/0x1a0
         sock_alloc_send_pskb+0x16a/0x1b0
         ? wait_for_unix_gc+0x31/0x90
         ? alloc_set_pte+0x2ad/0x310
         unix_stream_sendmsg+0x28d/0x340
         sock_sendmsg+0x2d/0x40
         sock_write_iter+0x6c/0xc0
         __vfs_write+0xc0/0x120
         vfs_write+0x9b/0x1a0
         ? __might_fault+0x49/0xa0
         SyS_write+0x44/0x90
         do_fast_syscall_32+0xa6/0x1e0
         sysenter_past_esp+0x45/0x74
      
        Mem-Info:
        active_anon:104698 inactive_anon:105791 isolated_anon:192
         active_file:433 inactive_file:283 isolated_file:22
         unevictable:0 dirty:0 writeback:296 unstable:0
         slab_reclaimable:6389 slab_unreclaimable:78927
         mapped:474 shmem:0 pagetables:101426 bounce:0
         free:10518 free_pcp:334 free_cma:0
        Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
        DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
        Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
        HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        25121 total pagecache pages
        24160 pages in swap cache
        Swap cache stats: add 86371, delete 62211, find 42865/60187
        Free swap  = 4015560kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9658 pages reserved
        0 pages cma reserved
      
      The order-0 allocation for normal zone failed while there are a lot of
      reclaimable memory(i.e., anonymous memory with free swap).  I wanted to
      analyze the problem but it was hard because we removed per-zone lru stat
      so I couldn't know how many of anonymous memory there are in normal/dma
      zone.
      
      When we investigate OOM problem, reclaimable memory count is crucial
      stat to find a problem.  Without it, it's hard to parse the OOM message
      so I believe we should keep it.
      
      With per-zone lru stat,
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        Mem-Info:
        active_anon:101103 inactive_anon:102219 isolated_anon:0
         active_file:503 inactive_file:544 isolated_file:0
         unevictable:0 dirty:0 writeback:34 unstable:0
         slab_reclaimable:6298 slab_unreclaimable:74669
         mapped:863 shmem:0 pagetables:100998 bounce:0
         free:23573 free_pcp:1861 free_cma:0
        Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
        DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
        Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
        HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        54409 total pagecache pages
        53215 pages in swap cache
        Swap cache stats: add 300982, delete 247765, find 157978/226539
        Free swap  = 3803244kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9642 pages reserved
        0 pages cma reserved
      
      With that, we can see normal zone has a 86M reclaimable memory so we can
      know something goes wrong(I will fix the problem in next patch) in
      reclaim.
      
      [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
       Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
      Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71c799f4
    • M
      mm, vmstat: remove zone and node double accounting by approximating retries · bca67592
      Mel Gorman 提交于
      The number of LRU pages, dirty pages and writeback pages must be
      accounted for on both zones and nodes because of the reclaim retry
      logic, compaction retry logic and highmem calculations all depending on
      per-zone stats.
      
      Many lowmem allocations are immune from OOM kill due to a check in
      __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
      03668b3c ("oom: avoid oom killer for lowmem allocations").  The
      exception is costly high-order allocations or allocations that cannot
      fail.  If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
      allocations then it would fall through to __alloc_pages_direct_compact.
      
      This patch will blindly retry reclaim for zone-constrained allocations
      in should_reclaim_retry up to MAX_RECLAIM_RETRIES.  This is not ideal
      but without per-zone stats there are not many alternatives.  The impact
      it that zone-constrained allocations may delay before considering the
      OOM killer.
      
      As there is no guarantee enough memory can ever be freed to satisfy
      compaction, this patch avoids retrying compaction for zone-contrained
      allocations.
      
      In combination, that means that the per-node stats can be used when
      deciding whether to continue reclaim using a rough approximation.  While
      it is possible this will make the wrong decision on occasion, it will
      not infinite loop as the number of reclaim attempts is capped by
      MAX_RECLAIM_RETRIES.
      
      The final step is calculating the number of dirtyable highmem pages.  As
      those calculations only care about the global count of file pages in
      highmem.  This patch uses a global counter used instead of per-zone
      stats as it is sufficient.
      
      In combination, this allows the per-zone LRU and dirty state counters to
      be removed.
      
      [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
        Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested by: Michal Hocko <mhocko@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bca67592
    • M
      mm, vmstat: print node-based stats in zoneinfo file · e2ecc8a7
      Mel Gorman 提交于
      There are a number of stats that were previously accessible via zoneinfo
      that are now invisible.  While it is possible to create a new file for
      the node stats, this may be missed by users.  Instead this patch prints
      the stats under the first populated zone in /proc/zoneinfo.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-34-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2ecc8a7
    • M
      mm: vmstat: account per-zone stalls and pages skipped during reclaim · 7cc30fcf
      Mel Gorman 提交于
      The vmstat allocstall was fairly useful in the general sense but
      node-based LRUs change that.  It's important to know if a stall was for
      an address-limited allocation request as this will require skipping
      pages from other zones.  This patch adds pgstall_* counters to replace
      allocstall.  The sum of the counters will equal the old allocstall so it
      can be trivially recalculated.  A high number of address-limited
      allocation requests may result in a lot of useless LRU scanning for
      suitable pages.
      
      As address-limited allocations require pages to be skipped, it's
      important to know how much useless LRU scanning took place so this patch
      adds pgskip* counters.  This yields the following model
      
      1. The number of address-space limited stalls can be accounted for (pgstall)
      2. The amount of useless work required to reclaim the data is accounted (pgskip)
      3. The total number of scans is available from pgscan_kswapd and pgscan_direct
         so from that the ratio of useful to useless scans can be calculated.
      
      [mgorman@techsingularity.net: s/pgstall/allocstall/]
        Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc30fcf
    • M
      mm, page_alloc: remove fair zone allocation policy · e6cbd7f2
      Mel Gorman 提交于
      The fair zone allocation policy interleaves allocation requests between
      zones to avoid an age inversion problem whereby new pages are reclaimed
      to balance a zone.  Reclaim is now node-based so this should no longer
      be an issue and the fair zone allocation policy is not free.  This patch
      removes it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6cbd7f2
    • M
      mm: move vmscan writes and file write accounting to the node · c4a25635
      Mel Gorman 提交于
      As reclaim is now node-based, it follows that page write activity due to
      page reclaim should also be accounted for on the node.  For consistency,
      also account page writes and page dirtying on a per-node basis.
      
      After this patch, there are a few remaining zone counters that may appear
      strange but are fine.  NUMA stats are still per-zone as this is a
      user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
      NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
      potentially pin low memory and cannot trivially be reclaimed on demand.
      This information is still useful for debugging a page allocation failure
      warning.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a25635
    • M
      mm: move most file-based accounting to the node · 11fb9989
      Mel Gorman 提交于
      There are now a number of accounting oddities such as mapped file pages
      being accounted for on the node while the total number of file pages are
      accounted on the zone.  This can be coped with to some extent but it's
      confusing so this patch moves the relevant file-based accounted.  Due to
      throttling logic in the page allocator for reliable OOM detection, it is
      still necessary to track dirty and writeback pages on a per-zone basis.
      
      [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
        Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11fb9989
    • M
      mm: move page mapped accounting to the node · 50658e2e
      Mel Gorman 提交于
      Reclaim makes decisions based on the number of pages that are mapped but
      it's mixing node and zone information.  Account NR_FILE_MAPPED and
      NR_ANON_PAGES pages on the node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50658e2e
    • M
      mm, workingset: make working set detection node-aware · 1e6b1085
      Mel Gorman 提交于
      Working set and refault detection is still zone-based, fix it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e6b1085
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
  10. 27 7月, 2016 3 次提交
  11. 04 6月, 2016 1 次提交
  12. 21 5月, 2016 1 次提交
  13. 20 5月, 2016 6 次提交
    • M
      mm, page_alloc: inline pageblock lookup in page free fast paths · 0b423ca2
      Mel Gorman 提交于
      The function call overhead of get_pfnblock_flags_mask() is measurable in
      the page free paths.  This patch uses an inlined version that is faster.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b423ca2
    • M
      mm, page_alloc: inline zone_statistics · 060e7417
      Mel Gorman 提交于
      zone_statistics has one call-site but it's a public function.  Make it
      static and inline.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statbranch-v1r20           statinline-v1r20
        Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
        Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
        Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
        Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
        Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
        Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
        Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
        Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
        Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
        Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
        Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
        Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
        Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
        Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
        Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      060e7417
    • M
      mm, page_alloc: reduce branches in zone_statistics · b9f00e14
      Mel Gorman 提交于
      zone_statistics has more branches than it really needs to take an
      unlikely GFP flag into account.  Reduce the number and annotate the
      unlikely flag.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            nocompound-v1r10           statbranch-v1r10
        Min      alloc-odr0-1               417.00 (  0.00%)           419.00 ( -0.48%)
        Min      alloc-odr0-2               308.00 (  0.00%)           305.00 (  0.97%)
        Min      alloc-odr0-4               253.00 (  0.00%)           250.00 (  1.19%)
        Min      alloc-odr0-8               221.00 (  0.00%)           219.00 (  0.90%)
        Min      alloc-odr0-16              205.00 (  0.00%)           203.00 (  0.98%)
        Min      alloc-odr0-32              199.00 (  0.00%)           195.00 (  2.01%)
        Min      alloc-odr0-64              193.00 (  0.00%)           191.00 (  1.04%)
        Min      alloc-odr0-128             191.00 (  0.00%)           189.00 (  1.05%)
        Min      alloc-odr0-256             200.00 (  0.00%)           198.00 (  1.00%)
        Min      alloc-odr0-512             212.00 (  0.00%)           210.00 (  0.94%)
        Min      alloc-odr0-1024            219.00 (  0.00%)           216.00 (  1.37%)
        Min      alloc-odr0-2048            225.00 (  0.00%)           221.00 (  1.78%)
        Min      alloc-odr0-4096            231.00 (  0.00%)           227.00 (  1.73%)
        Min      alloc-odr0-8192            234.00 (  0.00%)           232.00 (  0.85%)
        Min      alloc-odr0-16384           234.00 (  0.00%)           232.00 (  0.85%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9f00e14
    • H
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins 提交于
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
    • J
      mm/vmstat: make node_page_state() handles all zones by itself · e87d59f7
      Joonsoo Kim 提交于
      node_page_state() manually adds statistics per each zone and returns
      total value for all zones.  Whenever we add a new zone, we need to
      consider this function and it's really troublesome.  Make it handle all
      zones by itself.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e87d59f7
    • J
      mm/vmstat: add zone range overlapping check · a91c43c7
      Joonsoo Kim 提交于
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      There are two places in vmstat.c that iterates pfn range and they don't
      consider this overlapping.  Add it.
      
      Without this patch, above system could over count pageblock number on a
      zone.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a91c43c7
  14. 18 3月, 2016 2 次提交
    • K
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov 提交于
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  15. 16 3月, 2016 2 次提交
    • V
      mm, page_owner: convert page_owner_inited to static key · 7dd80b8a
      Vlastimil Babka 提交于
      CONFIG_PAGE_OWNER attempts to impose negligible runtime overhead when
      enabled during compilation, but not actually enabled during runtime by
      boot param page_owner=on.  This overhead can be further reduced using
      the static key mechanism, which this patch does.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dd80b8a
    • V
      mm, page_owner: print migratetype of page and pageblock, symbolic flags · 60f30350
      Vlastimil Babka 提交于
      The information in /sys/kernel/debug/page_owner includes the migratetype
      of the pageblock the page belongs to.  This is also checked against the
      page's migratetype (as declared by gfp_flags during its allocation), and
      the page is reported as Fallback if its migratetype differs from the
      pageblock's one.  t This is somewhat misleading because in fact fallback
      allocation is not the only reason why these two can differ.  It also
      doesn't direcly provide the page's migratetype, although it's possible
      to derive that from the gfp_flags.
      
      It's arguably better to print both page and pageblock's migratetype and
      leave the interpretation to the consumer than to suggest fallback
      allocation as the only possible reason.  While at it, we can print the
      migratetypes as string the same way as /proc/pagetypeinfo does, as some
      of the numeric values depend on kernel configuration.  For that, this
      patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
      of mm/vmstat.c to mm/page_alloc.c and exports it.
      
      With the new format strings for flags, we can now also provide symbolic
      page and gfp flags in the /sys/kernel/debug/page_owner file.  This
      replaces the positional printing of page flags as single letters, which
      might have looked nicer, but was limited to a subset of flags, and
      required the user to remember the letters.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
        PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b4058>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
         [<ffffffff81160523>] generic_file_read_iter+0x453/0x760
         [<ffffffff811e0d57>] __vfs_read+0xa7/0xd0
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60f30350