1. 13 12月, 2016 27 次提交
    • P
      mm/mempolicy.c: forbid static or relative flags for local NUMA mode · 8d303e44
      Piotr Kwapulinski 提交于
      The MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES flags are irrelevant
      when setting them for MPOL_LOCAL NUMA memory policy via set_mempolicy or
      mbind.
      
      Return the "invalid argument" from set_mempolicy and mbind whenever any
      of these flags is passed along with MPOL_LOCAL.
      
      It is consistent with MPOL_PREFERRED passed with empty nodemask.
      
      It slightly shortens the execution time in paths where these flags are
      used e.g.  when trying to rebind the NUMA nodes for changes in cgroups
      cpuset mems (mpol_rebind_preferred()) or when just printing the mempolicy
      structure (/proc/PID/numa_maps).  Isolated tests done.
      
      Link: http://lkml.kernel.org/r/20161027163037.4089-1-kwapulinski.piotr@gmail.comSigned-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Liang Chen <liangchen.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Nathan Zimmer <nzimmer@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d303e44
    • L
      mm: fix up get_user_pages* comments · 80a79516
      Lorenzo Stoakes 提交于
      In the previous round of get_user_pages* changes comments attached to
      __get_user_pages_unlocked() and get_user_pages_unlocked() were rendered
      incorrect, this patch corrects them.
      
      In addition the get_user_pages_unlocked() comment seems to have already
      been outdated as it referred to tsk, mm parameters which were removed in
      c12d2da5 ("mm/gup: Remove the macro overload API migration helpers from
      the get_user*() APIs"), this patch fixes this also.
      
      Link: http://lkml.kernel.org/r/20161025233435.5338-1-lstoakes@gmail.comSigned-off-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80a79516
    • A
      mm: remove the page size change check in tlb_remove_page · 692a68c1
      Aneesh Kumar K.V 提交于
      Now that we check for page size change early in the loop, we can
      partially revert e9d55e15 ("mm: change the interface for
      __tlb_remove_page").
      
      This simplies the code much, by removing the need to track the last
      address with which we adjusted the range.  We also go back to the older
      way of filling the mmu_gather array, ie, we add an entry and then check
      whether the gather batch is full.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-6-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      692a68c1
    • A
      mm: add tlb_remove_check_page_size_change to track page size change · 07e32661
      Aneesh Kumar K.V 提交于
      With commit e77b0852 ("mm/mmu_gather: track page size with mmu
      gather and force flush if page size change") we added the ability to
      force a tlb flush when the page size change in a mmu_gather loop.  We
      did that by checking for a page size change every time we added a page
      to mmu_gather for lazy flush/remove.  We can improve that by moving the
      page size change check early and not doing it every time we add a page.
      
      This also helps us to do tlb flush when invalidating a range covering
      dax mapping.  Wrt dax mapping we don't have a backing struct page and
      hence we don't call tlb_remove_page, which earlier forced the tlb flush
      on page size change.  Moving the page size change check earlier means we
      will do the same even for dax mapping.
      
      We also avoid doing this check on architecture other than powerpc.
      
      In a later patch we will remove page size check from tlb_remove_page().
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-5-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      07e32661
    • A
      mm/hugetlb: add tlb_remove_hugetlb_entry for handling hugetlb pages · b528e4b6
      Aneesh Kumar K.V 提交于
      This add tlb_remove_hugetlb_entry similar to tlb_remove_pmd_tlb_entry.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-4-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b528e4b6
    • A
      mm: use the correct page size when removing the page · c0f2e176
      Aneesh Kumar K.V 提交于
      We are removing a pmd hugepage here.  Use the correct page size.
      
      Link: http://lkml.kernel.org/r/20161026084839.27299-2-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0f2e176
    • A
      shmem: avoid maybe-uninitialized warning · 23f919d4
      Arnd Bergmann 提交于
      After enabling -Wmaybe-uninitialized warnings, we get a false-postive
      warning for shmem:
      
        mm/shmem.c: In function `shmem_getpage_gfp':
        include/linux/spinlock.h:332:21: error: `info' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This can be easily avoided, since the correct 'info' pointer is known at
      the time we first enter the function, so we can simply move the
      initialization up.  Moving it before the first label avoids the warning
      and lets us remove two later initializations.
      
      Note that the function is so hard to read that it not only confuses the
      compiler, but also most readers and without this patch it could\ easily
      break if one of the 'goto's changed.
      
      Link: https://www.spinics.net/lists/kernel/msg2368133.html
      Link: http://lkml.kernel.org/r/20161024205725.786455-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23f919d4
    • M
      mm, compaction: fix NR_ISOLATED_* stats for pfn based migration · 6afcf8ef
      Ming Ling 提交于
      Since commit bda807d4 ("mm: migrate: support non-lru movable page
      migration") isolate_migratepages_block) can isolate !PageLRU pages which
      would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
      pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
      heuristics based on those counters such as pgdat_reclaimable_pages resp.
      too_many_isolated which would lead to unexpected stalls during the
      direct reclaim without any good reason.  Note that
      __alloc_contig_migrate_range can isolate a lot of pages at once.
      
      On mobile devices such as 512M ram android Phone, it may use a big zram
      swap.  In some cases zram(zsmalloc) uses too many non-lru but
      migratedable pages, such as:
      
            MemTotal: 468148 kB
            Normal free:5620kB
            Free swap:4736kB
            Total swap:409596kB
            ZRAM: 164616kB(zsmalloc non-lru pages)
            active_anon:60700kB
            inactive_anon:60744kB
            active_file:34420kB
            inactive_file:37532kB
      
      Fix this by only accounting lru pages to NR_ISOLATED_* in
      isolate_migratepages_block right after they were isolated and we still
      know they were on LRU.  Drop acct_isolated because it is called after
      the fact and we've lost that information.  Batching per-cpu counter
      doesn't make much improvement anyway.  Also make sure that we uncharge
      only LRU pages when putting them back on the LRU in
      putback_movable_pages resp.  when unmap_and_move migrates the page.
      
      [mhocko@suse.com: replace acct_isolated() with direct counting]
      Fixes: bda807d4 ("mm: migrate: support non-lru movable page migration")
      Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.orgSigned-off-by: NMing Ling <ming.ling@spreadtrum.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6afcf8ef
    • M
      mm, mempolicy: clean up __GFP_THISNODE confusion in policy_zonelist · 6d840958
      Michal Hocko 提交于
      __GFP_THISNODE is documented to enforce the allocation to be satisified
      from the requested node with no fallbacks or placement policy
      enforcements.  policy_zonelist seemingly breaks this semantic if the
      current policy is MPOL_MBIND and instead of taking the node it will
      fallback to the first node in the mask if the requested one is not in
      the mask.  This is confusing to say the least because it fact we
      shouldn't ever go that path.  First tasks shouldn't be scheduled on CPUs
      with nodes outside of their mempolicy binding.  And secondly
      policy_zonelist is called only from 3 places:
      
       - huge_zonelist - never should do __GFP_THISNODE when going this path
      
       - alloc_pages_vma - which shouldn't depend on __GFP_THISNODE either
      
       - alloc_pages_current - which uses default_policy id __GFP_THISNODE is
         used
      
      So we shouldn't even need to care about this possibility and can drop
      the confusing code.  Let's keep a WARN_ON_ONCE in place to catch
      potential users and fix them up properly (aka use a different allocation
      function which ignores mempolicy).
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20161013125958.32155-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d840958
    • D
      mm, thp: avoid unlikely branches for split_huge_pmd · fd60775a
      David Rientjes 提交于
      While doing MADV_DONTNEED on a large area of thp memory, I noticed we
      encountered many unlikely() branches in profiles for each backing
      hugepage.  This is because zap_pmd_range() would call split_huge_pmd(),
      which rechecked the conditions that were already validated, but as part
      of an unlikely() branch.
      
      Avoid the unlikely() branch when in a context where pmd is known to be
      good for __split_huge_pmd() directly.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1610181600300.84525@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd60775a
    • Z
      mm/vmalloc.c: simplify /proc/vmallocinfo implementation · 3f500069
      zijun_hu 提交于
      Many seq_file helpers exist for simplifying implementation of virtual
      files especially, for /proc nodes.  however, the helpers for iteration
      over list_head are available but aren't adopted to implement
      /proc/vmallocinfo currently.
      
      Simplify /proc/vmallocinfo implementation by using existing seq_file
      helpers.
      
      Link: http://lkml.kernel.org/r/57FDF2E5.1000201@zoho.comSigned-off-by: Nzijun_hu <zijun_hu@htc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f500069
    • M
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim 提交于
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • M
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim 提交于
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • M
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim 提交于
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • M
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim 提交于
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
    • A
      kmemleak: fix reference to Documentation · 22901c6c
      Andreas Platschek 提交于
      Documentation/kmemleak.txt was moved to Documentation/dev-tools/kmemleak.rst,
      this fixes the reference to the new location.
      
      Link: http://lkml.kernel.org/r/1476544946-18804-1-git-send-email-andreas.platschek@opentech.atSigned-off-by: NAndreas Platschek <andreas.platschek@opentech.at>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22901c6c
    • A
    • A
      mm/hugetlb.c: use the right pte val for compare in hugetlb_cow · 3999f52e
      Aneesh Kumar K.V 提交于
      We cannot use the pte value used in set_pte_at for pte_same comparison,
      because archs like ppc64, filter/add new pte flag in set_pte_at.
      Instead fetch the pte value inside hugetlb_cow.  We are comparing pte
      value to make sure the pte didn't change since we dropped the page table
      lock.  hugetlb_cow get called with page table lock held, and we can take
      a copy of the pte value before we drop the page table lock.
      
      With hugetlbfs, we optimize the MAP_PRIVATE write fault path with no
      previous mapping (huge_pte_none entries), by forcing a cow in the fault
      path.  This avoid take an addition fault to covert a read-only mapping
      to read/write.  Here we were comparing a recently instantiated pte (via
      set_pte_at) to the pte values from linux page table.  As explained above
      on ppc64 such pte_same check returned wrong result, resulting in us
      taking an additional fault on ppc64.
      
      Fixes: 6a119eae ("powerpc/mm: Add a _PAGE_PTE bit")
      Link: http://lkml.kernel.org/r/20161018154245.18023-1-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Scott Wood <scottwood@freescale.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3999f52e
    • T
      mm/gup.c: make unnecessarily global vma_permits_fault() static · 771ab430
      Tobias Klauser 提交于
      Make vma_permits_fault() static as it is only used in mm/gup.c
      
      This fixes a sparse warning.
      
      Link: http://lkml.kernel.org/r/20161017122353.31598-1-tklauser@distanz.chSigned-off-by: NTobias Klauser <tklauser@distanz.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      771ab430
    • S
      mm/vmscan.c: set correct defer count for shrinker · 5f33a080
      Shaohua Li 提交于
      Our system uses significantly more slab memory with memcg enabled with
      the latest kernel.  With 3.10 kernel, slab uses 2G memory, while with
      4.6 kernel, 6G memory is used.  The shrinker has problem.  Let's see we
      have two memcg for one shrinker.  In do_shrink_slab:
      
      1. Check cg1.  nr_deferred = 0, assume total_scan = 700.  batch size
         is 1024, then no memory is freed.  nr_deferred = 700
      
      2. Check cg2.  nr_deferred = 700.  Assume freeable = 20, then
         total_scan = 10 or 40.  Let's assume it's 10.  No memory is freed.
         nr_deferred = 10.
      
      The deferred share of cg1 is lost in this case.  kswapd will free no
      memory even run above steps again and again.
      
      The fix makes sure one memcg's deferred share isn't lost.
      
      Link: http://lkml.kernel.org/r/2414be961b5d25892060315fbb56bb19d81d0c07.1476227351.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f33a080
    • A
      mm/mprotect.c: don't touch single threaded PTEs which are on the right node · 3e321587
      Andi Kleen 提交于
      We had some problems with pages getting unmapped in single threaded
      affinitized processes.  It was tracked down to NUMA scanning.
      
      In this case it doesn't make any sense to unmap pages if the process is
      single threaded and the page is already on the node the process is
      running on.
      
      Add a check for this case into the numa protection code, and skip
      unmapping if true.
      
      In theory the process could be migrated later, but we will eventually
      rescan and unmap and migrate then.
      
      In theory this could be made more fancy: remembering this state per
      process or even whole mm.  However that would need extra tracking and be
      more complicated, and the simple check seems to work fine so far.
      
      [ak@linux.intel.com: v3: Minor updates from Mel. Change code layout]
        Link: http://lkml.kernel.org/r/1476382117-5440-1-git-send-email-andi@firstfloor.org
      Link: http://lkml.kernel.org/r/1476288949-20970-1-git-send-email-andi@firstfloor.orgSigned-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e321587
    • D
      mm, slab: maintain total slab count instead of active count · bf00bd34
      David Rientjes 提交于
      Rather than tracking the number of active slabs for each node, track the
      total number of slabs.  This is a minor improvement that avoids active
      slab tracking when a slab goes from free to partial or partial to free.
      
      For slab debugging, this also removes an explicit free count since it
      can easily be inferred by the difference in number of total objects and
      number of active objects.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1612042020110.115755@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf00bd34
    • G
      mm, slab: faster active and free stats · f728b0a5
      Greg Thelen 提交于
      Reading /proc/slabinfo or monitoring slabtop(1) can become very
      expensive if there are many slab caches and if there are very lengthy
      per-node partial and/or free lists.
      
      Commit 07a63c41 ("mm/slab: improve performance of gathering slabinfo
      stats") addressed the per-node full lists which showed a significant
      improvement when no objects were freed.  This patch has the same
      motivation and optimizes the remainder of the usecases where there are
      very lengthy partial and free lists.
      
      This patch maintains per-node active_slabs (full and partial) and
      free_slabs rather than iterating the lists at runtime when reading
      /proc/slabinfo.
      
      When allocating 100GB of slab from a test cache where every slab page is
      on the partial list, reading /proc/slabinfo (includes all other slab
      caches on the system) takes ~247ms on average with 48 samples.
      
      As a result of this patch, the same read takes ~0.856ms on average.
      
      [rientjes@google.com: changelog]
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1611081505240.13403@chino.kir.corp.google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f728b0a5
    • T
      mm/slab_common.c: check kmem_create_cache flags are common · e70954fd
      Thomas Garnier 提交于
      Verify that kmem_create_cache flags are not allocator specific.  It is
      done before removing flags that are not available with the current
      configuration.
      
      The current kmem_cache_create removes incorrect flags but do not
      validate the callers are using them right.  This change will ensure that
      callers are not trying to create caches with flags that won't be used
      because allocator specific.
      
      Link: http://lkml.kernel.org/r/1478553075-120242-2-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e70954fd
    • A
      slub: avoid false-postive warning · 84582c8a
      Arnd Bergmann 提交于
      The slub allocator gives us some incorrect warnings when
      CONFIG_PROFILE_ANNOTATED_BRANCHES is set, as the unlikely() macro
      prevents it from seeing that the return code matches what it was before:
      
        mm/slub.c: In function `kmem_cache_free_bulk':
        mm/slub.c:262:23: error: `df.s' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.cnt' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2933:4470: error: `df.freelist' may be used uninitialized in this function [-Werror=maybe-uninitialized]
        mm/slub.c:2943:3: error: `df.tail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      I have not been able to come up with a perfect way for dealing with
      this, the three options I see are:
      
       - add a bogus initialization, which would increase the runtime overhead
       - replace unlikely() with unlikely_notrace()
       - remove the unlikely() annotation completely
      
      I checked the object code for a typical x86 configuration and the last
      two cases produce the same result, so I went for the last one, which is
      the simplest.
      
      Link: http://lkml.kernel.org/r/20161024155704.3114445-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84582c8a
    • V
      slub: move synchronize_sched out of slab_mutex on shrink · 89e364db
      Vladimir Davydov 提交于
      synchronize_sched() is a heavy operation and calling it per each cache
      owned by a memory cgroup being destroyed may take quite some time.  What
      is worse, it's currently called under the slab_mutex, stalling all works
      doing cache creation/destruction.
      
      Actually, there isn't much point in calling synchronize_sched() for each
      cache - it's enough to call it just once - after setting cpu_partial for
      all caches and before shrinking them.  This way, we can also move it out
      of the slab_mutex, which we have to hold for iterating over the slab
      cache list.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
      Link: http://lkml.kernel.org/r/0a10d71ecae3db00fb4421bcd3f82bcc911f4be4.1475329751.git.vdavydov.dev@gmail.comSigned-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: NDoug Smythies <dsmythies@telus.net>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e364db
    • V
      mm: memcontrol: use special workqueue for creating per-memcg caches · 13583c3d
      Vladimir Davydov 提交于
      Creating a lot of cgroups at the same time might stall all worker
      threads with kmem cache creation works, because kmem cache creation is
      done with the slab_mutex held.  The problem was amplified by commits
      801faf0d ("mm/slab: lockless decision to grow cache") in case of
      SLAB and 81ae6d03 ("mm/slub.c: replace kick_all_cpus_sync() with
      synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
      increased the maximal time the slab_mutex can be held.
      
      To prevent that from happening, let's use a special ordered single
      threaded workqueue for kmem cache creation.  This shouldn't introduce
      any functional changes regarding how kmem caches are created, as the
      work function holds the global slab_mutex during its whole runtime
      anyway, making it impossible to run more than one work at a time.  By
      using a single threaded workqueue, we just avoid creating a thread per
      each work.  Ordering is required to avoid a situation when a cgroup's
      work is put off indefinitely because there are other cgroups to serve,
      in other words to guarantee fairness.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
      Link: http://lkml.kernel.org/r/20161004131417.GC1862@esperanzaSigned-off-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reported-by: NDoug Smythies <dsmythies@telus.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13583c3d
  2. 07 12月, 2016 1 次提交
    • L
      shmem: fix shm fallocate() list corruption · 10d20bd2
      Linus Torvalds 提交于
      The shmem hole punching with fallocate(FALLOC_FL_PUNCH_HOLE) does not
      want to race with generating new pages by faulting them in.
      
      However, the wait-queue used to delay the page faulting has a serious
      problem: the wait queue head (in shmem_fallocate()) is allocated on the
      stack, and the code expects that "wake_up_all()" will make sure that all
      the queue entries are gone before the stack frame is de-allocated.
      
      And that is not at all necessarily the case.
      
      Yes, a normal wake-up sequence will remove the wait-queue entry that
      caused the wakeup (see "autoremove_wake_function()"), but the key
      wording there is "that caused the wakeup".  When there are multiple
      possible wakeup sources, the wait queue entry may well stay around.
      
      And _particularly_ in a page fault path, we may be faulting in new pages
      from user space while we also have other things going on, and there may
      well be other pending wakeups.
      
      So despite the "wake_up_all()", it's not at all guaranteed that all list
      entries are removed from the wait queue head on the stack.
      
      Fix this by introducing a new wakeup function that removes the list
      entry unconditionally, even if the target process had already woken up
      for other reasons.  Use that "synchronous" function to set up the
      waiters in shmem_fault().
      
      This problem has never been seen in the wild afaik, but Dave Jones has
      reported it on and off while running trinity.  We thought we fixed the
      stack corruption with the blk-mq rq_list locking fix (commit
      7fe31130: "blk-mq: update hardware and software queues for sleeping
      alloc"), but it turns out there was _another_ stack corruptor hiding
      in the trinity runs.
      
      Vegard Nossum (also running trinity) was able to trigger this one fairly
      consistently, and made us look once again at the shmem code due to the
      faults often being in that area.
      
      Reported-and-tested-by: Vegard Nossum <vegard.nossum@oracle.com>.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10d20bd2
  3. 03 12月, 2016 2 次提交
    • M
      mm, vmscan: add cond_resched() into shrink_node_memcg() · bd041733
      Michal Hocko 提交于
      Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
         23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
         (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
        Task dump for CPU 23:
        kswapd1         R  running task        0   148      2 0x00000008
        Call Trace:
          shrink_node+0xd2/0x2f0
          kswapd+0x2cb/0x6a0
          mem_cgroup_shrink_node+0x160/0x160
          kthread+0xbd/0xe0
          __switch_to+0x1fa/0x5c0
          ret_from_fork+0x1f/0x40
          kthread_create_on_node+0x180/0x180
      
      a closer code inspection has shown that we might indeed miss all the
      scheduling points in the reclaim path if no pages can be isolated from
      the LRU list.  This is a pathological case but other reports from Donald
      Buczek have shown that we might indeed hit such a path:
      
              clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
               kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
        [...]
               kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
      
      this is minute long snapshot which didn't take a single page from the
      LRU.  It is not entirely clear why only 1303 pages have been scanned
      during that time (maybe there was a heavy IRQ activity interfering).
      
      In any case it looks like we can really hit long periods without
      scheduling on non preemptive kernels so an explicit cond_resched() in
      shrink_node_memcg which is independent on the reclaim operation is due.
      
      Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Tested-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Reported-by: NDonald Buczek <buczek@molgen.mpg.de>
      Reported-by: N"Christopher S. Aker" <caker@theshore.net>
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd041733
    • M
      mm: workingset: fix NULL ptr in count_shadow_nodes · 20ab67a5
      Michal Hocko 提交于
      Commit 0a6b76dd ("mm: workingset: make shadow node shrinker memcg
      aware") has made the workingset shadow nodes shrinker memcg aware.  The
      implementation is not correct though because memcg_kmem_enabled() might
      become true while we are doing a global reclaim when the sc->memcg might
      be NULL which is exactly what Marek has seen:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000400
        IP: [<ffffffff8122d520>] mem_cgroup_node_nr_lru_pages+0x20/0x40
        PGD 0
        Oops: 0000 [#1] SMP
        CPU: 0 PID: 60 Comm: kswapd0 Tainted: G           O   4.8.10-12.pvops.qubes.x86_64 #1
        task: ffff880011863b00 task.stack: ffff880011868000
        RIP: mem_cgroup_node_nr_lru_pages+0x20/0x40
        RSP: e02b:ffff88001186bc70  EFLAGS: 00010293
        RAX: 0000000000000000 RBX: ffff88001186bd20 RCX: 0000000000000002
        RDX: 000000000000000c RSI: 0000000000000000 RDI: 0000000000000000
        RBP: ffff88001186bc70 R08: 28f5c28f5c28f5c3 R09: 0000000000000000
        R10: 0000000000006c34 R11: 0000000000000333 R12: 00000000000001f6
        R13: ffffffff81c6f6a0 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff880013c00000(0000) knlGS:ffff880013d00000
        CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000400 CR3: 00000000122f2000 CR4: 0000000000042660
        Call Trace:
          count_shadow_nodes+0x9a/0xa0
          shrink_slab.part.42+0x119/0x3e0
          shrink_node+0x22c/0x320
          kswapd+0x32c/0x700
          kthread+0xd8/0xf0
          ret_from_fork+0x1f/0x40
        Code: 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 3b 35 dd eb b1 00 55 48 89 e5 73 2c 89 d2 31 c9 31 c0 4c 63 ce 48 0f a3 ca 73 13 <4a> 8b b4 cf 00 04 00 00 41 89 c8 4a 03 84 c6 80 00 00 00 83 c1
        RIP  mem_cgroup_node_nr_lru_pages+0x20/0x40
         RSP <ffff88001186bc70>
        CR2: 0000000000000400
        ---[ end trace 100494b9edbdfc4d ]---
      
      This patch fixes the issue by checking sc->memcg rather than
      memcg_kmem_enabled() which is sufficient because shrink_slab makes sure
      that only memcg aware shrinkers will get non-NULL memcgs and only if
      memcg_kmem_enabled is true.
      
      Fixes: 0a6b76dd ("mm: workingset: make shadow node shrinker memcg aware")
      Link: http://lkml.kernel.org/r/20161201132156.21450-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMarek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
      Tested-by: NMarek Marczykowski-Górecki <marmarek@mimuw.edu.pl>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20ab67a5
  4. 01 12月, 2016 5 次提交
  5. 30 11月, 2016 1 次提交
  6. 18 11月, 2016 1 次提交
    • A
      mremap: fix race between mremap() and page cleanning · 5d190420
      Aaron Lu 提交于
      Prior to 3.15, there was a race between zap_pte_range() and
      page_mkclean() where writes to a page could be lost.  Dave Hansen
      discovered by inspection that there is a similar race between
      move_ptes() and page_mkclean().
      
      We've been able to reproduce the issue by enlarging the race window with
      a msleep(), but have not been able to hit it without modifying the code.
      So, we think it's a real issue, but is difficult or impossible to hit in
      practice.
      
      The zap_pte_range() issue is fixed by commit 1cf35d47("mm: split
      'tlb_flush_mmu()' into tlb flushing and memory freeing parts").  And
      this patch is to fix the race between page_mkclean() and mremap().
      
      Here is one possible way to hit the race: suppose a process mmapped a
      file with READ | WRITE and SHARED, it has two threads and they are bound
      to 2 different CPUs, e.g.  CPU1 and CPU2.  mmap returned X, then thread
      1 did a write to addr X so that CPU1 now has a writable TLB for addr X
      on it.  Thread 2 starts mremaping from addr X to Y while thread 1
      cleaned the page and then did another write to the old addr X again.
      The 2nd write from thread 1 could succeed but the value will get lost.
      
              thread 1                           thread 2
           (bound to CPU1)                    (bound to CPU2)
      
        1: write 1 to addr X to get a
           writeable TLB on this CPU
      
                                              2: mremap starts
      
                                              3: move_ptes emptied PTE for addr X
                                                 and setup new PTE for addr Y and
                                                 then dropped PTL for X and Y
      
        4: page laundering for N by doing
           fadvise FADV_DONTNEED. When done,
           pageframe N is deemed clean.
      
        5: *write 2 to addr X
      
                                              6: tlb flush for addr X
      
        7: munmap (Y, pagesize) to make the
           page unmapped
      
        8: fadvise with FADV_DONTNEED again
           to kick the page off the pagecache
      
        9: pread the page from file to verify
           the value. If 1 is there, it means
           we have lost the written 2.
      
        *the write may or may not cause segmentation fault, it depends on
        if the TLB is still on the CPU.
      
      Please note that this is only one specific way of how the race could
      occur, it didn't mean that the race could only occur in exact the above
      config, e.g. more than 2 threads could be involved and fadvise() could
      be done in another thread, etc.
      
      For anonymous pages, they could race between mremap() and page reclaim:
      THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
      PMD gets unmapped/splitted/pagedout before the flush tlb happened for
      the old huge PMD in move_page_tables() and we could still write data to
      it.  The normal anonymous page has similar situation.
      
      To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
      if any, did the flush before dropping the PTL.  If we did the flush for
      every move_ptes()/move_huge_pmd() call then we do not need to do the
      flush in move_pages_tables() for the whole range.  But if we didn't, we
      still need to do the whole range flush.
      
      Alternatively, we can track which part of the range is flushed in
      move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
      range in move_page_tables().  But that would require multiple tlb
      flushes for the different sub-ranges and should be less efficient than
      the single whole range flush.
      
      KBuild test on my Sandybridge desktop doesn't show any noticeable change.
      v4.9-rc4:
        real    5m14.048s
        user    32m19.800s
        sys     4m50.320s
      
      With this commit:
        real    5m13.888s
        user    32m19.330s
        sys     4m51.200s
      Reported-by: NDave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d190420
  7. 12 11月, 2016 3 次提交