1. 13 12月, 2016 1 次提交
  2. 03 12月, 2016 1 次提交
    • M
      mm, vmscan: add cond_resched() into shrink_node_memcg() · bd041733
      Michal Hocko 提交于
      Boris Zhmurov has reported RCU stalls during the kswapd reclaim:
      
        INFO: rcu_sched detected stalls on CPUs/tasks:
         23-...: (22 ticks this GP) idle=92f/140000000000000/0 softirq=2638404/2638404 fqs=23
         (detected by 4, t=6389 jiffies, g=786259, c=786258, q=42115)
        Task dump for CPU 23:
        kswapd1         R  running task        0   148      2 0x00000008
        Call Trace:
          shrink_node+0xd2/0x2f0
          kswapd+0x2cb/0x6a0
          mem_cgroup_shrink_node+0x160/0x160
          kthread+0xbd/0xe0
          __switch_to+0x1fa/0x5c0
          ret_from_fork+0x1f/0x40
          kthread_create_on_node+0x180/0x180
      
      a closer code inspection has shown that we might indeed miss all the
      scheduling points in the reclaim path if no pages can be isolated from
      the LRU list.  This is a pathological case but other reports from Donald
      Buczek have shown that we might indeed hit such a path:
      
              clusterd-989   [009] .... 118023.654491: mm_vmscan_direct_reclaim_end: nr_reclaimed=193
               kswapd1-86    [001] dN.. 118023.987475: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239830 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.320968: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239844 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.654375: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239858 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118024.987036: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239872 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.319651: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239886 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.652248: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239900 nr_taken=0 file=1
               kswapd1-86    [001] dN.. 118025.984870: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4239914 nr_taken=0 file=1
        [...]
               kswapd1-86    [001] dN.. 118084.274403: mm_vmscan_lru_isolate: isolate_mode=0 classzone=0 order=0 nr_requested=32 nr_scanned=4241133 nr_taken=0 file=1
      
      this is minute long snapshot which didn't take a single page from the
      LRU.  It is not entirely clear why only 1303 pages have been scanned
      during that time (maybe there was a heavy IRQ activity interfering).
      
      In any case it looks like we can really hit long periods without
      scheduling on non preemptive kernels so an explicit cond_resched() in
      shrink_node_memcg which is independent on the reclaim operation is due.
      
      Link: http://lkml.kernel.org/r/20161202095841.16648-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Tested-by: NBoris Zhmurov <bb@kernelpanic.ru>
      Reported-by: NDonald Buczek <buczek@molgen.mpg.de>
      Reported-by: N"Christopher S. Aker" <caker@theshore.net>
      Reported-by: NPaul Menzel <pmenzel@molgen.mpg.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd041733
  3. 28 10月, 2016 1 次提交
    • J
      mm: memcontrol: do not recurse in direct reclaim · 89a28483
      Johannes Weiner 提交于
      On 4.0, we saw a stack corruption from a page fault entering direct
      memory cgroup reclaim, calling into btrfs_releasepage(), which then
      tried to allocate an extent and recursed back into a kmem charge ad
      nauseam:
      
        [...]
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        memcg_charge_kmem+0x40/0x80
        new_slab+0x2d9/0x5a0
        __slab_alloc+0x2fd/0x44f
        kmem_cache_alloc+0x193/0x1e0
        alloc_extent_state+0x21/0xc0
        __clear_extent_bit+0x2b5/0x400
        try_release_extent_mapping+0x1a3/0x220
        __btrfs_releasepage+0x31/0x70
        btrfs_releasepage+0x2c/0x30
        try_to_release_page+0x32/0x50
        shrink_page_list+0x6da/0x7a0
        shrink_inactive_list+0x1e5/0x510
        shrink_lruvec+0x605/0x7f0
        shrink_zone+0xee/0x320
        do_try_to_free_pages+0x174/0x440
        try_to_free_mem_cgroup_pages+0xa7/0x130
        try_charge+0x17b/0x830
        mem_cgroup_try_charge+0x65/0x1c0
        handle_mm_fault+0x117f/0x1510
        __do_page_fault+0x177/0x420
        do_page_fault+0xc/0x10
        page_fault+0x22/0x30
      
      On later kernels, kmem charging is opt-in rather than opt-out, and that
      particular kmem allocation in btrfs_releasepage() is no longer being
      charged and won't recurse and overrun the stack anymore.
      
      But it's not impossible for an accounted allocation to happen from the
      memcg direct reclaim context, and we needed to reproduce this crash many
      times before we even got a useful stack trace out of it.
      
      Like other direct reclaimers, mark tasks in memcg reclaim PF_MEMALLOC to
      avoid recursing into any other form of direct reclaim.  Then let
      recursive charges from PF_MEMALLOC contexts bypass the cgroup limit.
      
      Link: http://lkml.kernel.org/r/20161025141050.GA13019@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89a28483
  4. 08 10月, 2016 5 次提交
    • A
      mm: use zonelist name instead of using hardcoded index · c9634cf0
      Aneesh Kumar K.V 提交于
      Use the existing enums instead of hardcoded index when looking at the
      zonelist.  This makes it more readable.  No functionality change by this
      patch.
      
      Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9634cf0
    • M
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko 提交于
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • V
      mm, vmscan: make compaction_ready() more accurate and readable · fdd4c614
      Vlastimil Babka 提交于
      The compaction_ready() is used during direct reclaim for costly order
      allocations to skip reclaim for zones where compaction should be
      attempted instead.  It's combining the standard compaction_suitable()
      check with its own watermark check based on high watermark with extra
      gap, and the result is confusing at best.
      
      This patch attempts to better structure and document the checks
      involved.  First, compaction_suitable() can determine that the
      allocation should either succeed already, or that compaction doesn't
      have enough free pages to proceed.  The third possibility is that
      compaction has enough free pages, but we still decide to reclaim first -
      unless we are already above the high watermark with gap.  This does not
      mean that the reclaim will actually reach this watermark during single
      attempt, this is rather an over-reclaim protection.  So document the
      code as such.  The check for compaction_deferred() is removed
      completely, as it in fact had no proper role here.
      
      The result after this patch is mainly a less confusing code.  We also
      skip some over-reclaim in cases where the allocation should already
      succed.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd4c614
    • V
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka 提交于
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • V
      mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS · cf378319
      Vlastimil Babka 提交于
      COMPACT_PARTIAL has historically meant that compaction returned after
      doing some work without fully compacting a zone.  It however didn't
      distinguish if compaction terminated because it succeeded in creating
      the requested high-order page.  This has changed recently and now we
      only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
      high-order watermark check in compaction_suitable() passes and no
      compaction needs to be done.
      
      So at this point we can make the return value clearer by renaming it to
      COMPACT_SUCCESS.  The next patch will remove some redundant tests for
      success where compaction just returned COMPACT_SUCCESS.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf378319
  5. 25 9月, 2016 1 次提交
    • H
      mm: delete unnecessary and unsafe init_tlb_ubc() · b385d21f
      Hugh Dickins 提交于
      init_tlb_ubc() looked unnecessary to me: tlb_ubc is statically
      initialized with zeroes in the init_task, and copied from parent to
      child while it is quiescent in arch_dup_task_struct(); so I went to
      delete it.
      
      But inserted temporary debug WARN_ONs in place of init_tlb_ubc() to
      check that it was always empty at that point, and found them firing:
      because memcg reclaim can recurse into global reclaim (when allocating
      biosets for swapout in my case), and arrive back at the init_tlb_ubc()
      in shrink_node_memcg().
      
      Resetting tlb_ubc.flush_required at that point is wrong: if the upper
      level needs a deferred TLB flush, but the lower level turns out not to,
      we miss a TLB flush.  But fortunately, that's the only part of the
      protocol that does not nest: with the initialization removed, cpumask
      collects bits from upper and lower levels, and flushes TLB when needed.
      
      Fixes: 72b252ae ("mm: send one IPI per CPU to TLB flush all entries after unmapping pages")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: stable@vger.kernel.org # 4.3+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b385d21f
  6. 02 9月, 2016 1 次提交
  7. 03 8月, 2016 1 次提交
  8. 29 7月, 2016 29 次提交