1. 23 2月, 2017 4 次提交
    • M
      oom, trace: add oom detection tracepoints · d379f01d
      Michal Hocko 提交于
      should_reclaim_retry is the central decision point for declaring the
      OOM.  It might be really useful to expose data used for this decision
      making when debugging an unexpected oom situations.
      
      Say we have an OOM report:
      [   52.264001] mem_eater invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
      [   52.267549] CPU: 3 PID: 3148 Comm: mem_eater Tainted: G        W       4.8.0-oomtrace3-00006-gb21338b386d2 #1024
      
      Now we can check the tracepoint data to see how we have ended up in this
      situation:
             mem_eater-3148  [003] ....    52.432801: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11134 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433269: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11103 min_wmark=11084 no_progress_loops=1 wmark_check=1
             mem_eater-3148  [003] ....    52.433712: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11100 min_wmark=11084 no_progress_loops=2 wmark_check=1
             mem_eater-3148  [003] ....    52.434067: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11097 min_wmark=11084 no_progress_loops=3 wmark_check=1
             mem_eater-3148  [003] ....    52.434414: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11094 min_wmark=11084 no_progress_loops=4 wmark_check=1
             mem_eater-3148  [003] ....    52.434761: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11091 min_wmark=11084 no_progress_loops=5 wmark_check=1
             mem_eater-3148  [003] ....    52.435108: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11087 min_wmark=11084 no_progress_loops=6 wmark_check=1
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA32 order=0 reclaimable=51 available=11084 min_wmark=11084 no_progress_loops=7 wmark_check=0
             mem_eater-3148  [003] ....    52.435478: reclaim_retry_zone: node=0 zone=DMA order=0 reclaimable=0 available=1126 min_wmark=179 no_progress_loops=7 wmark_check=0
      
      The above shows that we can quickly deduce that the reclaim stopped
      making any progress (see no_progress_loops increased in each round) and
      while there were still some 51 reclaimable pages they couldn't be
      dropped for some reason (vmscan trace points would tell us more about
      that part).  available will represent reclaimable + free_pages scaled
      down per no_progress_loops factor.  This is essentially an optimistic
      estimate of how much memory we would have when reclaiming everything.
      This can be compared to min_wmark to get a rought idea but the
      wmark_check tells the result of the watermark check which is more
      precise (includes lowmem reserves, considers the order etc.).  As we can
      see no zone is eligible in the end and that is why we have triggered the
      oom in this situation.
      
      Please note that higher order requests might fail on the wmark_check
      even when there is much more memory available than min_wmark - e.g.
      when the memory is fragmented.  A follow up tracepoint will help to
      debug those situations.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d379f01d
    • V
      mm, page_alloc: avoid page_to_pfn() when merging buddies · 13ad59df
      Vlastimil Babka 提交于
      On architectures that allow memory holes, page_is_buddy() has to perform
      page_to_pfn() to check for the memory hole.  After the previous patch,
      we have the pfn already available in __free_one_page(), which is the
      only caller of page_is_buddy(), so move the check there and avoid
      page_to_pfn().
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13ad59df
    • V
      mm, page_alloc: don't convert pfn to idx when merging · 76741e77
      Vlastimil Babka 提交于
      In __free_one_page() we do the buddy merging arithmetics on "page/buddy
      index", which is just the lower MAX_ORDER bits of pfn.  The operations
      we do that affect the higher bits are bitwise AND and subtraction (in
      that order), where the final result will be the same with the higher
      bits left unmasked, as long as these bits are equal for both buddies -
      which must be true by the definition of a buddy.
      
      We can therefore use pfn's directly instead of "index" and skip the
      zeroing of >MAX_ORDER bits.  This can help a bit by itself, although
      compiler might be smart enough already.  It also helps the next patch to
      avoid page_to_pfn() for memory hole checks.
      
      Link: http://lkml.kernel.org/r/20161216120009.20064-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76741e77
    • M
      mm: throttle show_mem() from warn_alloc() · aa187507
      Michal Hocko 提交于
      Tetsuo has been stressing OOM killer path with many parallel allocation
      requests when he has noticed that it is not all that hard to swamp
      kernel logs with warn_alloc messages caused by allocation stalls.  Even
      though the allocation stall message is triggered only once in 10s there
      might be many different tasks hitting it roughly around the same time.
      
      A big part of the output is show_mem() which can generate a lot of
      output even on a small machines.  There is no reason to show the state
      of memory counter for each allocation stall, especially when multiple of
      them are reported in a short time period.  Chances are that not much has
      changed since the last report.  This patch simply rate limits show_mem
      called from warn_alloc to only dump something once per second.  This
      should be enough to give us a clue why an allocation might be stalling
      while burst of warnings will not swamp log with too much data.
      
      While we are at it, extract all the show_mem related handling (filters)
      into a separate function warn_alloc_show_mem.  This will make the code
      cleaner and as a bonus point we can distinguish which part of warn_alloc
      got throttled due to rate limiting as ___ratelimit dumps the caller.
      
      [akpm@linux-foundation.org: reduce scope of the ratelimit_states]
      Link: http://lkml.kernel.org/r/20161215101510.9030-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa187507
  2. 25 1月, 2017 5 次提交
  3. 11 1月, 2017 5 次提交
  4. 15 12月, 2016 1 次提交
    • A
      mm: add support for releasing multiple instances of a page · 44fdffd7
      Alexander Duyck 提交于
      Add a function that allows us to batch free a page that has multiple
      references outstanding.  Specifically this function can be used to drop
      a page being used in the page frag alloc cache.  With this drivers can
      make use of functionality similar to the page frag alloc cache without
      having to do any workarounds for the fact that there is no function that
      frees multiple references.
      
      Link: http://lkml.kernel.org/r/20161110113606.76501.70752.stgit@ahduyck-blue-test.jf.intel.comSigned-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Hans-Christian Noren Egtvedt <egtvedt@samfundet.no>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Keguang Zhang <keguang.zhang@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Tobias Klauser <tklauser@distanz.ch>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44fdffd7
  5. 13 12月, 2016 5 次提交
    • M
      mm, page_alloc: keep pcp count and list contents in sync if struct page is corrupted · a6de734b
      Mel Gorman 提交于
      Vlastimil Babka pointed out that commit 479f854a ("mm, page_alloc:
      defer debugging checks of pages allocated from the PCP") will allow the
      per-cpu list counter to be out of sync with the per-cpu list contents if
      a struct page is corrupted.
      
      The consequence is an infinite loop if the per-cpu lists get fully
      drained by free_pcppages_bulk because all the lists are empty but the
      count is positive.  The infinite loop occurs here
      
                      do {
                              batch_free++;
                              if (++migratetype == MIGRATE_PCPTYPES)
                                      migratetype = 0;
                              list = &pcp->lists[migratetype];
                      } while (list_empty(list));
      
      What the user sees is a bad page warning followed by a soft lockup with
      interrupts disabled in free_pcppages_bulk().
      
      This patch keeps the accounting in sync.
      
      Fixes: 479f854a ("mm, page_alloc: defer debugging checks of pages allocated from the PCP")
      Link: http://lkml.kernel.org/r/20161202112951.23346-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6de734b
    • M
      mm: make unreserve highatomic functions reliable · 29fac03b
      Minchan Kim 提交于
      Currently, unreserve_highatomic_pageblock bails out if it found
      highatomic pageblock regardless of really moving free pages from the one
      so that it could mitigate unreserve logic's goal which saves OOM of a
      process.
      
      This patch makes unreserve functions bail out only if it moves some
      pages out of !highatomic free list to avoid such false positive.
      
      Another potential problem is that by race between page freeing and
      reserve highatomic function, pages could be in highatomic free list even
      though the pageblock is !high atomic migratetype.  In that case,
      unreserve_highatomic_pageblock can be void if count of highatomic
      reserve is less than pageblock_nr_pages.  We could solve it simply via
      draining all of reserved pages before the OOM.  It would have a
      safeguard role to exhuast reserved pages before converging to OOM.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-5-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29fac03b
    • M
      mm: try to exhaust highatomic reserve before the OOM · 04c8716f
      Minchan Kim 提交于
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      It's weird to show that zone has enough free memory above min watermark
      but OOMed with 4K GFP_KERNEL allocation due to reserved highatomic
      pages.  As last resort, try to unreserve highatomic pages again and if
      it has moved pages to non-highatmoc free list, retry reclaim once more.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-4-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04c8716f
    • M
      mm: prevent double decrease of nr_reserved_highatomic · 4855e4a7
      Minchan Kim 提交于
      There is race between page freeing and unreserved highatomic.
      
       CPU 0				    CPU 1
      
          free_hot_cold_page
            mt = get_pfnblock_migratetype
            set_pcppage_migratetype(page, mt)
          				    unreserve_highatomic_pageblock
          				    spin_lock_irqsave(&zone->lock)
          				    move_freepages_block
          				    set_pageblock_migratetype(page)
          				    spin_unlock_irqrestore(&zone->lock)
            free_pcppages_bulk
              __free_one_page(mt) <- mt is stale
      
      By above race, a page on CPU 0 could go non-highorderatomic free list
      since the pageblock's type is changed.  By that, unreserve logic of
      highorderatomic can decrease reserved count on a same pageblock severak
      times and then it will make mismatch between nr_reserved_highatomic and
      the number of reserved pageblock.
      
      So, this patch verifies whether the pageblock is highatomic or not and
      decrease the count only if the pageblock is highatomic.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-3-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4855e4a7
    • M
      mm: don't steal highatomic pageblock · 88ed365e
      Minchan Kim 提交于
      Patch series "use up highorder free pages before OOM", v3.
      
      I got OOM report from production team with v4.4 kernel.  It had enough
      free memory but failed to allocate GFP_KERNEL order-0 page and finally
      encountered OOM kill.  It occured during QA process which launches
      several apps, switching and so on.  It happned rarely.  IOW, In normal
      situation, it was not a problem but if we are unluck so that several
      apps uses peak memory at the same time, it can happen.  If we manage to
      pass the phase, the system can go working well.
      
      I could reproduce it with my test(memory spike easily.  Look at below.
      
      The reason is free pages(19M) of DMA32 zone are reserved for
      HIGHORDERATOMIC and doesn't unreserved before the OOM.
      
        balloon invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
        balloon cpuset=/ mems_allowed=0
        CPU: 1 PID: 8473 Comm: balloon Tainted: G        W  OE   4.8.0-rc7-00219-g3f74c9559583-dirty #3161
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          dump_header+0x5c/0x1ce
          oom_kill_process+0x22e/0x400
          out_of_memory+0x1ac/0x210
          __alloc_pages_nodemask+0x101e/0x1040
          handle_mm_fault+0xa0a/0xbf0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:383949 inactive_anon:106724 isolated_anon:0
         active_file:15 inactive_file:44 isolated_file:0
         unevictable:0 dirty:0 writeback:24 unstable:0
         slab_reclaimable:2483 slab_unreclaimable:3326
         mapped:0 shmem:0 pagetables:1906 bounce:0
         free:6898 free_pcp:291 free_cma:0
        Node 0 active_anon:1535796kB inactive_anon:426896kB active_file:60kB inactive_file:176kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:96kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:1418 all_unreclaimable? no
        DMA free:8188kB min:44kB low:56kB high:68kB active_anon:7648kB inactive_anon:0kB active_file:0kB inactive_file:4kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:20kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:19404kB min:5628kB low:7624kB high:9620kB active_anon:1528148kB inactive_anon:426896kB active_file:60kB inactive_file:420kB unevictable:0kB writepending:96kB present:2080640kB managed:2030092kB mlocked:0kB slab_reclaimable:9932kB slab_unreclaimable:13284kB kernel_stack:2496kB pagetables:7624kB bounce:0kB free_pcp:900kB local_pcp:112kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 2*4096kB (H) = 8192kB
        DMA32: 7*4kB (H) 8*8kB (H) 30*16kB (H) 31*32kB (H) 14*64kB (H) 9*128kB (H) 2*256kB (H) 2*512kB (H) 4*1024kB (H) 5*2048kB (H) 0*4096kB = 19484kB
        51131 total pagecache pages
        50795 pages in swap cache
        Swap cache stats: add 3532405601, delete 3532354806, find 124289150/1822712228
        Free swap  = 8kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12658 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      Another example exceeded the limit by the race is
      
        in:imklog: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
        CPU: 0 PID: 476 Comm: in:imklog Tainted: G            E   4.8.0-rc7-00217-g266ef83c51e5-dirty #3135
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
        Call Trace:
          dump_stack+0x63/0x90
          warn_alloc_failed+0xdb/0x130
          __alloc_pages_nodemask+0x4d6/0xdb0
          new_slab+0x339/0x490
          ___slab_alloc.constprop.74+0x367/0x480
          __slab_alloc.constprop.73+0x20/0x40
          __kmalloc+0x1a4/0x1e0
          alloc_indirect.isra.14+0x1d/0x50
          virtqueue_add_sgs+0x1c4/0x470
          __virtblk_add_req+0xae/0x1f0
          virtio_queue_rq+0x12d/0x290
          __blk_mq_run_hw_queue+0x239/0x370
          blk_mq_run_hw_queue+0x8f/0xb0
          blk_mq_insert_requests+0x18c/0x1a0
          blk_mq_flush_plug_list+0x125/0x140
          blk_flush_plug_list+0xc7/0x220
          blk_finish_plug+0x2c/0x40
          __do_page_cache_readahead+0x196/0x230
          filemap_fault+0x448/0x4f0
          ext4_filemap_fault+0x36/0x50
          __do_fault+0x75/0x140
          handle_mm_fault+0x84d/0xbe0
          __do_page_fault+0x1dd/0x4d0
          trace_do_page_fault+0x43/0x130
          do_async_page_fault+0x1a/0xa0
          async_page_fault+0x28/0x30
        Mem-Info:
        active_anon:363826 inactive_anon:121283 isolated_anon:32
         active_file:65 inactive_file:152 isolated_file:0
         unevictable:0 dirty:0 writeback:46 unstable:0
         slab_reclaimable:2778 slab_unreclaimable:3070
         mapped:112 shmem:0 pagetables:1822 bounce:0
         free:9469 free_pcp:231 free_cma:0
        Node 0 active_anon:1455304kB inactive_anon:485132kB active_file:260kB inactive_file:608kB unevictable:0kB isolated(anon):128kB isolated(file):0kB mapped:448kB dirty:0kB writeback:184kB shmem:0kB writeback_tmp:0kB unstable:0kB pages_scanned:13641 all_unreclaimable? no
        DMA free:7748kB min:44kB low:56kB high:68kB active_anon:7944kB inactive_anon:104kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15908kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:108kB kernel_stack:0kB pagetables:4kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
        lowmem_reserve[]: 0 1952 1952 1952
        DMA32 free:30128kB min:5628kB low:7624kB high:9620kB active_anon:1447360kB inactive_anon:485028kB active_file:260kB inactive_file:608kB unevictable:0kB writepending:184kB present:2080640kB managed:2030132kB mlocked:0kB slab_reclaimable:11112kB slab_unreclaimable:12172kB kernel_stack:2400kB pagetables:7284kB bounce:0kB free_pcp:924kB local_pcp:72kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0
        DMA: 7*4kB (UE) 3*8kB (UH) 1*16kB (M) 0*32kB 2*64kB (U) 1*128kB (M) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (U) 1*4096kB (H) = 7748kB
        DMA32: 10*4kB (H) 3*8kB (H) 47*16kB (H) 38*32kB (H) 5*64kB (H) 1*128kB (H) 2*256kB (H) 3*512kB (H) 3*1024kB (H) 3*2048kB (H) 4*4096kB (H) = 30128kB
        2775 total pagecache pages
        2536 pages in swap cache
        Swap cache stats: add 206786828, delete 206784292, find 7323106/106686077
        Free swap  = 108744kB
        Total swap = 255996kB
        524158 pages RAM
        0 pages HighMem/MovableOnly
        12648 pages reserved
        0 pages cma reserved
        0 pages hwpoisoned
      
      During the investigation, I found some problems with highatomic so this
      patch aims to solve the problems and the final goal is to unreserve
      every highatomic free pages before the OOM kill.
      
      This patch (of 4):
      
      In page freeing path, migratetype is racy so that a highorderatomic page
      could free into non-highorderatomic free list.  If that page is
      allocated, VM can change the pageblock from higorderatomic to something.
      In that case, highatomic pageblock accounting is broken so it doesn't
      work(e.g., VM cannot reserve highorderatomic pageblocks any more
      although it doesn't reach 1% limit).
      
      So, this patch prohibits the changing from highatomic to other type.
      It's no problem because MIGRATE_HIGHATOMIC is not listed in fallback
      array so stealing will only happen due to unexpected races which is
      really rare.  Also, such prohibiting keeps highatomic pageblock more
      longer so it would be better for highorderatomic page allocation.
      
      Link: http://lkml.kernel.org/r/1476259429-18279-2-git-send-email-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sangseok Lee <sangseok.lee@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88ed365e
  6. 12 11月, 2016 1 次提交
  7. 10 11月, 2016 1 次提交
  8. 01 11月, 2016 1 次提交
    • K
      latent_entropy: Fix wrong gcc code generation with 64 bit variables · 58bea414
      Kees Cook 提交于
      The stack frame size could grow too large when the plugin used long long
      on 32-bit architectures when the given function had too many basic blocks.
      
      The gcc warning was:
      
      drivers/pci/hotplug/ibmphp_ebda.c: In function 'ibmphp_access_ebda':
      drivers/pci/hotplug/ibmphp_ebda.c:409:1: warning: the frame size of 1108 bytes is larger than 1024 bytes [-Wframe-larger-than=]
      
      This switches latent_entropy from u64 to unsigned long.
      
      Thanks to PaX Team and Emese Revfy for the patch.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      58bea414
  9. 28 10月, 2016 2 次提交
  10. 26 10月, 2016 1 次提交
    • J
      mm/page_alloc: Remove kernel address exposure in free_reserved_area() · adb1fe9a
      Josh Poimboeuf 提交于
      Linus suggested we try to remove some of the low-hanging fruit related
      to kernel address exposure in dmesg.  The only leaks I see on my local
      system are:
      
        Freeing SMP alternatives memory: 32K (ffffffff9e309000 - ffffffff9e311000)
        Freeing initrd memory: 10588K (ffffa0b736b42000 - ffffa0b737599000)
        Freeing unused kernel memory: 3592K (ffffffff9df87000 - ffffffff9e309000)
        Freeing unused kernel memory: 1352K (ffffa0b7288ae000 - ffffa0b728a00000)
        Freeing unused kernel memory: 632K (ffffa0b728d62000 - ffffa0b728e00000)
      
      Linus says:
      
        "I suspect we should just remove [the addresses in the 'Freeing'
         messages]. I'm sure they are useful in theory, but I suspect they
         were more useful back when the whole "free init memory" was
         originally done.
      
         These days, if we have a use-after-free, I suspect the init-mem
         situation is the easiest situation by far. Compared to all the dynamic
         allocations which are much more likely to show it anyway. So having
         debug output for that case is likely not all that productive."
      
      With this patch the freeing messages now look like this:
      
        Freeing SMP alternatives memory: 32K
        Freeing initrd memory: 10588K
        Freeing unused kernel memory: 3592K
        Freeing unused kernel memory: 1352K
        Freeing unused kernel memory: 632K
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/6836ff90c45b71d38e5d4405aec56fa9e5d1d4b2.1477405374.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      adb1fe9a
  11. 11 10月, 2016 2 次提交
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
    • E
      gcc-plugins: Add latent_entropy plugin · 38addce8
      Emese Revfy 提交于
      This adds a new gcc plugin named "latent_entropy". It is designed to
      extract as much possible uncertainty from a running system at boot time as
      possible, hoping to capitalize on any possible variation in CPU operation
      (due to runtime data differences, hardware differences, SMP ordering,
      thermal timing variation, cache behavior, etc).
      
      At the very least, this plugin is a much more comprehensive example for
      how to manipulate kernel code using the gcc plugin internals.
      
      The need for very-early boot entropy tends to be very architecture or
      system design specific, so this plugin is more suited for those sorts
      of special cases. The existing kernel RNG already attempts to extract
      entropy from reliable runtime variation, but this plugin takes the idea to
      a logical extreme by permuting a global variable based on any variation
      in code execution (e.g. a different value (and permutation function)
      is used to permute the global based on loop count, case statement,
      if/then/else branching, etc).
      
      To do this, the plugin starts by inserting a local variable in every
      marked function. The plugin then adds logic so that the value of this
      variable is modified by randomly chosen operations (add, xor and rol) and
      random values (gcc generates separate static values for each location at
      compile time and also injects the stack pointer at runtime). The resulting
      value depends on the control flow path (e.g., loops and branches taken).
      
      Before the function returns, the plugin mixes this local variable into
      the latent_entropy global variable. The value of this global variable
      is added to the kernel entropy pool in do_one_initcall() and _do_fork(),
      though it does not credit any bytes of entropy to the pool; the contents
      of the global are just used to mix the pool.
      
      Additionally, the plugin can pre-initialize arrays with build-time
      random contents, so that two different kernel builds running on identical
      hardware will not have the same starting values.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message and code comments]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      38addce8
  12. 08 10月, 2016 12 次提交