1. 30 7月, 2016 1 次提交
    • L
      Revert "vfs: add lookup_hash() helper" · 20d00ee8
      Linus Torvalds 提交于
      This reverts commit 3c9fe8cd.
      
      As Miklos points out in commit c1b2cc1a, the "lookup_hash()" helper
      is now unused, and in fact, with the hash salting changes, since the
      hash of a dentry name now depends on the directory dentry it is in, the
      helper function isn't even really likely to be useful.
      
      So rather than keep it around in case somebody else might end up finding
      a use for it, let's just remove the helper and not trick people into
      thinking it might be a useful thing.
      
      For example, I had obviously completely missed how the helper didn't
      follow the normal dentry hashing patterns, and how the hash salting
      patch broke overlayfs.  Things would quietly build and look sane, but
      not work.
      Suggested-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20d00ee8
  2. 29 7月, 2016 39 次提交
    • M
      mm: export filemap_check_errors() to modules · d72d9e2a
      Miklos Szeredi 提交于
      Can be used by fuse, btrfs and f2fs to replace opencoded variants.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      d72d9e2a
    • V
      mm, compaction: simplify contended compaction handling · c3486f53
      Vlastimil Babka 提交于
      Async compaction detects contention either due to failing trylock on
      zone->lock or lru_lock, or by need_resched().  Since 1f9efdef ("mm,
      compaction: khugepaged should not give up due to need_resched()") the
      code got quite complicated to distinguish these two up to the
      __alloc_pages_slowpath() level, so different decisions could be taken
      for khugepaged allocations.
      
      After the recent changes, khugepaged allocations don't check for
      contended compaction anymore, so we again don't need to distinguish lock
      and sched contention, and simplify the current convoluted code a lot.
      
      However, I believe it's also possible to simplify even more and
      completely remove the check for contended compaction after the initial
      async compaction for costly orders, which was originally aimed at THP
      page fault allocations.  There are several reasons why this can be done
      now:
      
      - with the new defaults, THP page faults no longer do reclaim/compaction at
        all, unless the system admin has overridden the default, or application has
        indicated via madvise that it can benefit from THP's. In both cases, it
        means that the potential extra latency is expected and worth the benefits.
      - even if reclaim/compaction proceeds after this patch where it previously
        wouldn't, the second compaction attempt is still async and will detect the
        contention and back off, if the contention persists
      - there are still heuristics like deferred compaction and pageblock skip bits
        in place that prevent excessive THP page fault latencies
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3486f53
    • V
      mm, compaction: introduce direct compaction priority · a5508cd8
      Vlastimil Babka 提交于
      In the context of direct compaction, for some types of allocations we
      would like the compaction to either succeed or definitely fail while
      trying as hard as possible.  Current async/sync_light migration mode is
      insufficient, as there are heuristics such as caching scanner positions,
      marking pageblocks as unsuitable or deferring compaction for a zone.  At
      least the final compaction attempt should be able to override these
      heuristics.
      
      To communicate how hard compaction should try, we replace migration mode
      with a new enum compact_priority and change the relevant function
      signatures.  In compact_zone_order() where struct compact_control is
      constructed, the priority is mapped to suitable control flags.  This
      patch itself has no functional change, as the current priority levels
      are mapped back to the same migration modes as before.  Expanding them
      will be done next.
      
      Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is
      removed, as the only caller exists under CONFIG_COMPACTION.
      
      Link: http://lkml.kernel.org/r/20160721073614.24395-8-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5508cd8
    • V
      mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations · 25160354
      Vlastimil Babka 提交于
      After the previous patch, we can distinguish costly allocations that
      should be really lightweight, such as THP page faults, with
      __GFP_NORETRY.  This means we don't need to recognize khugepaged
      allocations via PF_KTHREAD anymore.  We can also change THP page faults
      in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
      khugepaged, as the process has indicated that it benefits from THP's and
      is willing to pay some initial latency costs.
      
      We can also make the flags handling less cryptic by distinguishing
      GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
      GFP_TRANSHUGE (only direct reclaim, khugepaged default).  Adding
      __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
      
      The patch effectively changes the current GFP_TRANSHUGE users as
      follows:
      
      * get_huge_zero_page() - the zero page lifetime should be relatively
        long and it's shared by multiple users, so it's worth spending some
        effort on it.  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
        This also restores direct reclaim to this allocation, which was
        unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
        by default to madvise and add a stall-free defrag option")
      
      * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
        is not an issue.  So if khugepaged "defrag" is enabled (the default), do
        reclaim via GFP_TRANSHUGE without __GFP_NORETRY.  We can remove the
        PF_KTHREAD check from page alloc.
      
        As a side-effect, khugepaged will now no longer check if the initial
        compaction was deferred or contended.  This is OK, as khugepaged sleep
        times between collapsion attempts are long enough to prevent noticeable
        disruption, so we should allow it to spend some effort.
      
      * migrate_misplaced_transhuge_page() - already was masking out
        __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
        equivalent.
      
      * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
        are now allocating without __GFP_NORETRY.  Other vma's keep using
        __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
        it's allowed only for madvised vma's).  The rest is conversion to
        GFP_TRANSHUGE(_LIGHT).
      
      [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
      Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25160354
    • A
      mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB · 80a9201a
      Alexander Potapenko 提交于
      For KASAN builds:
       - switch SLUB allocator to using stackdepot instead of storing the
         allocation/deallocation stacks in the objects;
       - change the freelist hook so that parts of the freelist can be put
         into the quarantine.
      
      [aryabinin@virtuozzo.com: fixes]
        Link: http://lkml.kernel.org/r/1468601423-28676-1-git-send-email-aryabinin@virtuozzo.com
      Link: http://lkml.kernel.org/r/1468347165-41906-3-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80a9201a
    • A
      mm, kasan: account for object redzone in SLUB's nearest_obj() · c146a2b9
      Alexander Potapenko 提交于
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Previously, when KASAN had detected an error on an object from a cache
      with SLAB_RED_ZONE set, the actual start address of the object was
      miscalculated, which led to random stacks having been reported.
      
      When looking up the nearest SLUB object for a given address, correctly
      calculate its offset if SLAB_RED_ZONE is enabled for that cache.
      
      Fixes: 7ed2f9e6 ("mm, kasan: SLAB support")
      Link: http://lkml.kernel.org/r/1468347165-41906-2-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Steven Rostedt (Red Hat) <rostedt@goodmis.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Kuthonuzo Luruo <kuthonuzo.luruo@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c146a2b9
    • D
      mm/memblock.c: add new infrastructure to address the mem limit issue · a571d4eb
      Dennis Chen 提交于
      In some cases, memblock is queried by kernel to determine whether a
      specified address is RAM or not.  For example, the ACPI core needs this
      information to determine which attributes to use when mapping ACPI
      regions(acpi_os_ioremap).  Use of incorrect memory types can result in
      faults, data corruption, or other issues.
      
      Removing memory with memblock_enforce_memory_limit() throws away this
      information, and so a kernel booted with 'mem=' may suffer from the
      issues described above.  To avoid this, we need to keep those NOMAP
      regions instead of removing all above the limit, which preserves the
      information we need while preventing other use of those regions.
      
      This patch adds new infrastructure to retain all NOMAP memblock regions
      while removing others, to cater for this.
      
      Link: http://lkml.kernel.org/r/1468475036-5852-2-git-send-email-dennis.chen@arm.comSigned-off-by: NDennis Chen <dennis.chen@arm.com>
      Acked-by: NSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Kaly Xin <kaly.xin@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a571d4eb
    • A
      kdb: use task_cpu() instead of task_thread_info()->cpu · e558af65
      Andy Lutomirski 提交于
      We'll need this cleanup to make the cpu field in thread_info be
      optional.
      
      Link: http://lkml.kernel.org/r/da298328dc77ea494576c2f20a934218e758a6fa.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e558af65
    • A
      mm: fix memcg stack accounting for sub-page stacks · efdc9490
      Andy Lutomirski 提交于
      We should account for stacks regardless of stack size, and we need to
      account in sub-page units if THREAD_SIZE < PAGE_SIZE.  Change the units
      to kilobytes and Move it into account_kernel_stack().
      
      Fixes: 12580e4b ("mm: memcontrol: report kernel stack usage in cgroup2 memory.stat")
      Link: http://lkml.kernel.org/r/9b5314e3ee5eda61b0317ec1563768602c1ef438.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efdc9490
    • A
      mm: track NR_KERNEL_STACK in KiB instead of number of stacks · d30dd8be
      Andy Lutomirski 提交于
      Currently, NR_KERNEL_STACK tracks the number of kernel stacks in a zone.
      This only makes sense if each kernel stack exists entirely in one zone,
      and allowing vmapped stacks could break this assumption.
      
      Since frv has THREAD_SIZE < PAGE_SIZE, we need to track kernel stack
      allocations in a unit that divides both THREAD_SIZE and PAGE_SIZE on all
      architectures.  Keep it simple and use KiB.
      
      Link: http://lkml.kernel.org/r/083c71e642c5fa5f1b6898902e1b2db7b48940d4.1468523549.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d30dd8be
    • D
      mm: cleanup ifdef guards for vmem_altmap · 11db0486
      Dan Williams 提交于
      Now that ZONE_DEVICE depends on SPARSEMEM_VMEMMAP we can simplify some
      ifdef guards to just ZONE_DEVICE.
      
      Link: http://lkml.kernel.org/r/146687646788.39261.8020536391978771940.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Eric Sandeen <sandeen@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11db0486
    • H
      mm, THP: clean up return value of madvise_free_huge_pmd · 319904ad
      Huang Ying 提交于
      The definition of return value of madvise_free_huge_pmd is not clear
      before.  According to the suggestion of Minchan Kim, change the type of
      return value to bool and return true if we do MADV_FREE successfully on
      entire pmd page, otherwise, return false.  Comments are added too.
      
      Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      319904ad
    • M
      mm: remove reclaim and compaction retry approximations · 5a1c84b4
      Mel Gorman 提交于
      If per-zone LRU accounting is available then there is no point
      approximating whether reclaim and compaction should retry based on pgdat
      statistics.  This is effectively a revert of "mm, vmstat: remove zone
      and node double accounting by approximating retries" with the difference
      that inactive/active stats are still available.  This preserves the
      history of why the approximation was retried and why it had to be
      reverted to handle OOM kills on 32-bit systems.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a1c84b4
    • M
      mm, vmscan: remove highmem_file_pages · bb4cc2be
      Mel Gorman 提交于
      With the reintroduction of per-zone LRU stats, highmem_file_pages is
      redundant so remove it.
      
      [mgorman@techsingularity.net: wrong stat is being accumulated in highmem_dirtyable_memory]
        Link: http://lkml.kernel.org/r/20160725092324.GM10438@techsingularity.netLink: http://lkml.kernel.org/r/1469110261-7365-3-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb4cc2be
    • M
      mm: add per-zone lru list stat · 71c799f4
      Minchan Kim 提交于
      When I did stress test with hackbench, I got OOM message frequently
      which didn't ever happen in zone-lru.
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        ..
        ..
         __alloc_pages_nodemask+0xe52/0xe60
         ? new_slab+0x39c/0x3b0
         new_slab+0x39c/0x3b0
         ___slab_alloc.constprop.87+0x6da/0x840
         ? __alloc_skb+0x3c/0x260
         ? _raw_spin_unlock_irq+0x27/0x60
         ? trace_hardirqs_on_caller+0xec/0x1b0
         ? finish_task_switch+0xa6/0x220
         ? poll_select_copy_remaining+0x140/0x140
         __slab_alloc.isra.81.constprop.86+0x40/0x6d
         ? __alloc_skb+0x3c/0x260
         kmem_cache_alloc+0x22c/0x260
         ? __alloc_skb+0x3c/0x260
         __alloc_skb+0x3c/0x260
         alloc_skb_with_frags+0x4e/0x1a0
         sock_alloc_send_pskb+0x16a/0x1b0
         ? wait_for_unix_gc+0x31/0x90
         ? alloc_set_pte+0x2ad/0x310
         unix_stream_sendmsg+0x28d/0x340
         sock_sendmsg+0x2d/0x40
         sock_write_iter+0x6c/0xc0
         __vfs_write+0xc0/0x120
         vfs_write+0x9b/0x1a0
         ? __might_fault+0x49/0xa0
         SyS_write+0x44/0x90
         do_fast_syscall_32+0xa6/0x1e0
         sysenter_past_esp+0x45/0x74
      
        Mem-Info:
        active_anon:104698 inactive_anon:105791 isolated_anon:192
         active_file:433 inactive_file:283 isolated_file:22
         unevictable:0 dirty:0 writeback:296 unstable:0
         slab_reclaimable:6389 slab_unreclaimable:78927
         mapped:474 shmem:0 pagetables:101426 bounce:0
         free:10518 free_pcp:334 free_cma:0
        Node 0 active_anon:418792kB inactive_anon:423164kB active_file:1732kB inactive_file:1132kB unevictable:0kB isolated(anon):768kB isolated(file):88kB mapped:1896kB dirty:0kB writeback:1184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1478632 all_unreclaimable? yes
        DMA free:3304kB min:68kB low:84kB high:100kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:4088kB kernel_stack:0kB pagetables:2480kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3436kB min:3604kB low:4504kB high:5404kB present:897016kB managed:858460kB mlocked:0kB slab_reclaimable:25556kB slab_unreclaimable:311712kB kernel_stack:164608kB pagetables:30844kB bounce:0kB free_pcp:620kB local_pcp:104kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:33808kB min:512kB low:1796kB high:3080kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:372252kB bounce:0kB free_pcp:428kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 2*4kB (UM) 2*8kB (UM) 0*16kB 1*32kB (U) 1*64kB (U) 2*128kB (UM) 1*256kB (U) 1*512kB (M) 0*1024kB 1*2048kB (U) 0*4096kB = 3192kB
        Normal: 33*4kB (MH) 79*8kB (ME) 11*16kB (M) 4*32kB (M) 2*64kB (ME) 2*128kB (EH) 7*256kB (EH) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3244kB
        HighMem: 2590*4kB (UM) 1568*8kB (UM) 491*16kB (UM) 60*32kB (UM) 6*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 33064kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        25121 total pagecache pages
        24160 pages in swap cache
        Swap cache stats: add 86371, delete 62211, find 42865/60187
        Free swap  = 4015560kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9658 pages reserved
        0 pages cma reserved
      
      The order-0 allocation for normal zone failed while there are a lot of
      reclaimable memory(i.e., anonymous memory with free swap).  I wanted to
      analyze the problem but it was hard because we removed per-zone lru stat
      so I couldn't know how many of anonymous memory there are in normal/dma
      zone.
      
      When we investigate OOM problem, reclaimable memory count is crucial
      stat to find a problem.  Without it, it's hard to parse the OOM message
      so I believe we should keep it.
      
      With per-zone lru stat,
      
        gfp_mask=0x26004c0(GFP_KERNEL|__GFP_REPEAT|__GFP_NOTRACK), order=0
        Mem-Info:
        active_anon:101103 inactive_anon:102219 isolated_anon:0
         active_file:503 inactive_file:544 isolated_file:0
         unevictable:0 dirty:0 writeback:34 unstable:0
         slab_reclaimable:6298 slab_unreclaimable:74669
         mapped:863 shmem:0 pagetables:100998 bounce:0
         free:23573 free_pcp:1861 free_cma:0
        Node 0 active_anon:404412kB inactive_anon:409040kB active_file:2012kB inactive_file:2176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:3452kB dirty:0kB writeback:136kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1320845 all_unreclaimable? yes
        DMA free:3296kB min:68kB low:84kB high:100kB active_anon:5540kB inactive_anon:0kB active_file:0kB inactive_file:0kB present:15992kB managed:15916kB mlocked:0kB slab_reclaimable:248kB slab_unreclaimable:2628kB kernel_stack:792kB pagetables:2316kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 809 1965 1965
        Normal free:3600kB min:3604kB low:4504kB high:5404kB active_anon:86304kB inactive_anon:0kB active_file:160kB inactive_file:376kB present:897016kB managed:858524kB mlocked:0kB slab_reclaimable:24944kB slab_unreclaimable:296048kB kernel_stack:163832kB pagetables:35892kB bounce:0kB free_pcp:3076kB local_pcp:656kB free_cma:0kB
        lowmem_reserve[]: 0 0 9247 9247
        HighMem free:86156kB min:512kB low:1796kB high:3080kB active_anon:312852kB inactive_anon:410024kB active_file:1924kB inactive_file:2012kB present:1183736kB managed:1183736kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:365784kB bounce:0kB free_pcp:3868kB local_pcp:720kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 8*4kB (UM) 8*8kB (UM) 4*16kB (M) 2*32kB (UM) 2*64kB (UM) 1*128kB (M) 3*256kB (UME) 2*512kB (UE) 1*1024kB (E) 0*2048kB 0*4096kB = 3296kB
        Normal: 240*4kB (UME) 160*8kB (UME) 23*16kB (ME) 3*32kB (UE) 3*64kB (UME) 2*128kB (ME) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
        HighMem: 10942*4kB (UM) 3102*8kB (UM) 866*16kB (UM) 76*32kB (UM) 11*64kB (UM) 4*128kB (UM) 1*256kB (M) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 86344kB
        Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
        54409 total pagecache pages
        53215 pages in swap cache
        Swap cache stats: add 300982, delete 247765, find 157978/226539
        Free swap  = 3803244kB
        Total swap = 4192252kB
        524186 pages RAM
        295934 pages HighMem/MovableOnly
        9642 pages reserved
        0 pages cma reserved
      
      With that, we can see normal zone has a 86M reclaimable memory so we can
      know something goes wrong(I will fix the problem in next patch) in
      reclaim.
      
      [mgorman@techsingularity.net: rename zone LRU stats in /proc/vmstat]
       Link: http://lkml.kernel.org/r/20160725072300.GK10438@techsingularity.net
      Link: http://lkml.kernel.org/r/1469110261-7365-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71c799f4
    • M
      mm, vmscan: Update all zone LRU sizes before updating memcg · 7ee36a14
      Mel Gorman 提交于
      Minchan Kim reported setting the following warning on a 32-bit system
      although it can affect 64-bit systems.
      
        WARNING: CPU: 4 PID: 1322 at mm/memcontrol.c:998 mem_cgroup_update_lru_size+0x103/0x110
        mem_cgroup_update_lru_size(f44b4000, 1, -7): zid 1 lru_size 1 but empty
        Modules linked in:
        CPU: 4 PID: 1322 Comm: cp Not tainted 4.7.0-rc4-mm1+ #143
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
          dump_stack+0x76/0xaf
          __warn+0xea/0x110
          ? mem_cgroup_update_lru_size+0x103/0x110
          warn_slowpath_fmt+0x3b/0x40
          mem_cgroup_update_lru_size+0x103/0x110
          isolate_lru_pages.isra.61+0x2e2/0x360
          shrink_active_list+0xac/0x2a0
          ? __delay+0xe/0x10
          shrink_node_memcg+0x53c/0x7a0
          shrink_node+0xab/0x2a0
          do_try_to_free_pages+0xc6/0x390
          try_to_free_pages+0x245/0x590
      
      LRU list contents and counts are updated separately.  Counts are updated
      before pages are added to the LRU and updated after pages are removed.
      The warning above is from a check in mem_cgroup_update_lru_size that
      ensures that list sizes of zero are empty.
      
      The problem is that node-lru needs to account for highmem pages if
      CONFIG_HIGHMEM is set.  One impact of the implementation is that the
      sizes are updated in multiple passes when pages from multiple zones were
      isolated.  This happens whether HIGHMEM is set or not.  When multiple
      zones are isolated, it's possible for a debugging check in memcg to be
      tripped.
      
      This patch forces all the zone counts to be updated before the memcg
      function is called.
      
      Link: http://lkml.kernel.org/r/1468588165-12461-6-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ee36a14
    • M
      mm, vmstat: remove zone and node double accounting by approximating retries · bca67592
      Mel Gorman 提交于
      The number of LRU pages, dirty pages and writeback pages must be
      accounted for on both zones and nodes because of the reclaim retry
      logic, compaction retry logic and highmem calculations all depending on
      per-zone stats.
      
      Many lowmem allocations are immune from OOM kill due to a check in
      __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
      03668b3c ("oom: avoid oom killer for lowmem allocations").  The
      exception is costly high-order allocations or allocations that cannot
      fail.  If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
      allocations then it would fall through to __alloc_pages_direct_compact.
      
      This patch will blindly retry reclaim for zone-constrained allocations
      in should_reclaim_retry up to MAX_RECLAIM_RETRIES.  This is not ideal
      but without per-zone stats there are not many alternatives.  The impact
      it that zone-constrained allocations may delay before considering the
      OOM killer.
      
      As there is no guarantee enough memory can ever be freed to satisfy
      compaction, this patch avoids retrying compaction for zone-contrained
      allocations.
      
      In combination, that means that the per-node stats can be used when
      deciding whether to continue reclaim using a rough approximation.  While
      it is possible this will make the wrong decision on occasion, it will
      not infinite loop as the number of reclaim attempts is capped by
      MAX_RECLAIM_RETRIES.
      
      The final step is calculating the number of dirtyable highmem pages.  As
      those calculations only care about the global count of file pages in
      highmem.  This patch uses a global counter used instead of per-zone
      stats as it is sufficient.
      
      In combination, this allows the per-zone LRU and dirty state counters to
      be removed.
      
      [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
        Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested by: Michal Hocko <mhocko@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bca67592
    • M
      mm: vmstat: account per-zone stalls and pages skipped during reclaim · 7cc30fcf
      Mel Gorman 提交于
      The vmstat allocstall was fairly useful in the general sense but
      node-based LRUs change that.  It's important to know if a stall was for
      an address-limited allocation request as this will require skipping
      pages from other zones.  This patch adds pgstall_* counters to replace
      allocstall.  The sum of the counters will equal the old allocstall so it
      can be trivially recalculated.  A high number of address-limited
      allocation requests may result in a lot of useless LRU scanning for
      suitable pages.
      
      As address-limited allocations require pages to be skipped, it's
      important to know how much useless LRU scanning took place so this patch
      adds pgskip* counters.  This yields the following model
      
      1. The number of address-space limited stalls can be accounted for (pgstall)
      2. The amount of useless work required to reclaim the data is accounted (pgskip)
      3. The total number of scans is available from pgscan_kswapd and pgscan_direct
         so from that the ratio of useful to useless scans can be calculated.
      
      [mgorman@techsingularity.net: s/pgstall/allocstall/]
        Link: http://lkml.kernel.org/r/1468404004-5085-3-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-33-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc30fcf
    • M
      mm: vmstat: replace __count_zone_vm_events with a zone id equivalent · 16709d1d
      Mel Gorman 提交于
      This is partially a preparation patch for more vmstat work but it also
      has the slight advantage that __count_zid_vm_events is cheaper to
      calculate than __count_zone_vm_events().
      
      Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16709d1d
    • M
      mm, page_alloc: remove fair zone allocation policy · e6cbd7f2
      Mel Gorman 提交于
      The fair zone allocation policy interleaves allocation requests between
      zones to avoid an age inversion problem whereby new pages are reclaimed
      to balance a zone.  Reclaim is now node-based so this should no longer
      be an issue and the fair zone allocation policy is not free.  This patch
      removes it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6cbd7f2
    • M
      mm, vmscan: add classzone information to tracepoints · e5146b12
      Mel Gorman 提交于
      This is convenient when tracking down why the skip count is high because
      it'll show what classzone kswapd woke up at and what zones are being
      isolated.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-29-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5146b12
    • M
      mm: convert zone_reclaim to node_reclaim · a5f5f91d
      Mel Gorman 提交于
      As reclaim is now per-node based, convert zone_reclaim to be
      node_reclaim.  It is possible that a node will be reclaimed multiple
      times if it has multiple zones but this is unavoidable without caching
      all nodes traversed so far.  The documentation and interface to
      userspace is the same from a configuration perspective and will will be
      similar in behaviour unless the node-local allocation requests were also
      limited to lower zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5f5f91d
    • M
      mm: move vmscan writes and file write accounting to the node · c4a25635
      Mel Gorman 提交于
      As reclaim is now node-based, it follows that page write activity due to
      page reclaim should also be accounted for on the node.  For consistency,
      also account page writes and page dirtying on a per-node basis.
      
      After this patch, there are a few remaining zone counters that may appear
      strange but are fine.  NUMA stats are still per-zone as this is a
      user-space interface that tools consume.  NR_MLOCK, NR_SLAB_*,
      NR_PAGETABLE, NR_KERNEL_STACK and NR_BOUNCE are all allocations that
      potentially pin low memory and cannot trivially be reclaimed on demand.
      This information is still useful for debugging a page allocation failure
      warning.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-21-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4a25635
    • M
      mm: move most file-based accounting to the node · 11fb9989
      Mel Gorman 提交于
      There are now a number of accounting oddities such as mapped file pages
      being accounted for on the node while the total number of file pages are
      accounted on the zone.  This can be coped with to some extent but it's
      confusing so this patch moves the relevant file-based accounted.  Due to
      throttling logic in the page allocator for reliable OOM detection, it is
      still necessary to track dirty and writeback pages on a per-zone basis.
      
      [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
        Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11fb9989
    • M
      mm: rename NR_ANON_PAGES to NR_ANON_MAPPED · 4b9d0fab
      Mel Gorman 提交于
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_PAGES  is the number of mapped anon pages.
      
      This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
      NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
      have
      
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_MAPPED is the number of mapped anon pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b9d0fab
    • M
      mm: move page mapped accounting to the node · 50658e2e
      Mel Gorman 提交于
      Reclaim makes decisions based on the number of pages that are mapped but
      it's mixing node and zone information.  Account NR_FILE_MAPPED and
      NR_ANON_PAGES pages on the node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50658e2e
    • M
      mm, page_alloc: consider dirtyable memory in terms of nodes · 281e3726
      Mel Gorman 提交于
      Historically dirty pages were spread among zones but now that LRUs are
      per-node it is more appropriate to consider dirty pages in a node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      281e3726
    • M
      mm, workingset: make working set detection node-aware · 1e6b1085
      Mel Gorman 提交于
      Working set and refault detection is still zone-based, fix it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e6b1085
    • M
      mm, memcg: move memcg limit enforcement from zones to nodes · ef8f2327
      Mel Gorman 提交于
      Memcg needs adjustment after moving LRUs to the node.  Limits are
      tracked per memcg but the soft-limit excess is tracked per zone.  As
      global page reclaim is based on the node, it is easy to imagine a
      situation where a zone soft limit is exceeded even though the memcg
      limit is fine.
      
      This patch moves the soft limit tree the node.  Technically, all the
      variable names should also change but people are already familiar by the
      meaning of "mz" even if "mn" would be a more appropriate name now.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-15-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8f2327
    • M
      mm, vmscan: make shrink_node decisions more node-centric · a9dd0a83
      Mel Gorman 提交于
      Earlier patches focused on having direct reclaim and kswapd use data
      that is node-centric for reclaiming but shrink_node() itself still uses
      too much zone information.  This patch removes unnecessary zone-based
      information with the most important decision being whether to continue
      reclaim or not.  Some memcg APIs are adjusted as a result even though
      memcg itself still uses some zone information.
      
      [mgorman@techsingularity.net: optimization]
        Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9dd0a83
    • M
      mm, vmscan: simplify the logic deciding whether kswapd sleeps · 38087d9b
      Mel Gorman 提交于
      kswapd goes through some complex steps trying to figure out if it should
      stay awake based on the classzone_idx and the requested order.  It is
      unnecessarily complex and passes in an invalid classzone_idx to
      balance_pgdat().  What matters most of all is whether a larger order has
      been requsted and whether kswapd successfully reclaimed at the previous
      order.  This patch irons out the logic to check just that and the end
      result is less headache inducing.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38087d9b
    • M
      mm, vmscan: remove balance gap · 31483b6a
      Mel Gorman 提交于
      The balance gap was introduced to apply equal pressure to all zones when
      reclaiming for a higher zone.  With node-based LRU, the need for the
      balance gap is removed and the code is dead so remove it.
      
      [vbabka@suse.cz: Also remove KSWAPD_ZONE_BALANCE_GAP_RATIO]
      Link: http://lkml.kernel.org/r/1467970510-21195-9-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31483b6a
    • M
      mm, mmzone: clarify the usage of zone padding · 0f661148
      Mel Gorman 提交于
      Zone padding separates write-intensive fields used by page allocation,
      compaction and vmstats but the comments are a little misleading and need
      clarification.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f661148
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmscan: move lru_lock to the node · a52633d8
      Mel Gorman 提交于
      Node-based reclaim requires node-based LRUs and locking.  This is a
      preparation patch that just moves the lru_lock to the node so later
      patches are easier to review.  It is a mechanical change but note this
      patch makes contention worse because the LRU lock is hotter and direct
      reclaim and kswapd can contend on the same lock even when reclaiming
      from different zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52633d8
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
    • J
      mm: fix vm-scalability regression in cgroup-aware workingset code · 55779ec7
      Johannes Weiner 提交于
      Commit 23047a96 ("mm: workingset: per-cgroup cache thrash
      detection") added a page->mem_cgroup lookup to the cache eviction,
      refault, and activation paths, as well as locking to the activation
      path, and the vm-scalability tests showed a regression of -23%.
      
      While the test in question is an artificial worst-case scenario that
      doesn't occur in real workloads - reading two sparse files in parallel
      at full CPU speed just to hammer the LRU paths - there is still some
      optimizations that can be done in those paths.
      
      Inline the lookup functions to eliminate calls.  Also, page->mem_cgroup
      doesn't need to be stabilized when counting an activation; we merely
      need to hold the RCU lock to prevent the memcg from being freed.
      
      This cuts down on overhead quite a bit:
      
      23047a96 063f6715e77a7be5770d6081fe
      ---------------- --------------------------
               %stddev     %change         %stddev
                   \          |                \
        21621405 +- 0%     +11.3%   24069657 +- 2%  vm-scalability.throughput
      
      [linux@roeck-us.net: drop unnecessary include file]
      [hannes@cmpxchg.org: add WARN_ON_ONCE()s]
        Link: http://lkml.kernel.org/r/20160707194024.GA26580@cmpxchg.org
      Link: http://lkml.kernel.org/r/20160624175101.GA3024@cmpxchg.orgReported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55779ec7
    • M
      mm, oom_reaper: do not attempt to reap a task more than twice · 11a410d5
      Michal Hocko 提交于
      oom_reaper relies on the mmap_sem for read to do its job.  Many places
      which might block readers have been converted to use down_write_killable
      and that has reduced chances of the contention a lot.  Some paths where
      the mmap_sem is held for write can take other locks and they might
      either be not prepared to fail due to fatal signal pending or too
      impractical to be changed.
      
      This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
      first attempt to reap a task's mm fails.  If the flag is present after
      the failure then we set MMF_OOM_REAPED to hide this mm from the oom
      killer completely so it can go and chose another victim.
      
      As a result a risk of OOM deadlock when the oom victim would be blocked
      indefinetly and so the oom killer cannot make any progress should be
      mitigated considerably while we still try really hard to perform all
      reclaim attempts and stay predictable in the behavior.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a410d5
    • M
      mm, oom: fortify task_will_free_mem() · 1af8bb43
      Michal Hocko 提交于
      task_will_free_mem is rather weak.  It doesn't really tell whether the
      task has chance to drop its mm.  98748bd7 ("oom: consider
      multi-threaded tasks in task_will_free_mem") made a first step into making
      it more robust for multi-threaded applications so now we know that the
      whole process is going down and probably drop the mm.
      
      This patch builds on top for more complex scenarios where mm is shared
      between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
      use_mm().
      
      Make sure that all processes sharing the mm are killed or exiting.  This
      will allow us to replace try_oom_reaper by wake_oom_reaper because
      task_will_free_mem implies the task is reapable now.  Therefore all paths
      which bypass the oom killer are now reapable and so they shouldn't lock up
      the oom killer.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8bb43