1. 19 12月, 2012 2 次提交
  2. 13 12月, 2012 5 次提交
  3. 12 12月, 2012 8 次提交
    • M
      mm: cma: remove watermark hacks · bc357f43
      Marek Szyprowski 提交于
      Commits 2139cbe6 ("cma: fix counting of isolated pages") and
      d95ea5d1 ("cma: fix watermark checking") introduced a reliable
      method of free page accounting when memory is being allocated from CMA
      regions, so the workaround introduced earlier by commit 49f223a9
      ("mm: trigger page reclaim in alloc_contig_range() to stabilise
      watermarks") can be finally removed.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc357f43
    • M
      mm: cma: skip watermarks check for already isolated blocks in split_free_page() · 2e30abd1
      Marek Szyprowski 提交于
      Since commit 2139cbe6 ("cma: fix counting of isolated pages") free
      pages in isolated pageblocks are not accounted to NR_FREE_PAGES counters,
      so watermarks check is not required if one operates on a free page in
      isolated pageblock.
      Signed-off-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e30abd1
    • R
      mm: introduce putback_movable_pages() · 5733c7d1
      Rafael Aquini 提交于
      The PATCH "mm: introduce compaction and migration for virtio ballooned pages"
      hacks around putback_lru_pages() in order to allow ballooned pages to be
      re-inserted on balloon page list as if a ballooned page was like a LRU page.
      
      As ballooned pages are not legitimate LRU pages, this patch introduces
      putback_movable_pages() to properly cope with cases where the isolated
      pageset contains ballooned pages and LRU pages, thus fixing the mentioned
      inelegant hack around putback_lru_pages().
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5733c7d1
    • W
      memory-hotplug: allocate zone's pcp before onlining pages · 6dcd73d7
      Wen Congyang 提交于
      We use __free_page() to put a page to buddy system when onlining pages.
      __free_page() will store NR_FREE_PAGES in zone's pcp.vm_stat_diff, so we
      should allocate zone's pcp before onlining pages, otherwise we will lose
      some free pages.
      
      [mhocko@suse.cz: make zone_pcp_reset independent of MEMORY_HOTREMOVE]
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dcd73d7
    • W
      memory-hotplug: fix NR_FREE_PAGES mismatch · 97d0da22
      Wen Congyang 提交于
      NR_FREE_PAGES will be wrong after offlining pages.  We add/dec
      NR_FREE_PAGES like this now:
      
      1. move all pages in buddy system to MIGRATE_ISOLATE, and dec NR_FREE_PAGES
      
      2. don't add NR_FREE_PAGES when it is freed and the migratetype is
         MIGRATE_ISOLATE
      
      3. dec NR_FREE_PAGES when offlining isolated pages.
      
      4. add NR_FREE_PAGES when undoing isolate pages.
      
      When we come to step 3, all pages are in MIGRATE_ISOLATE list, and
      NR_FREE_PAGES are right.  When we come to step4, all pages are not in
      buddy system, so we don't change NR_FREE_PAGES in this step, but we change
      NR_FREE_PAGES in step3.  So NR_FREE_PAGES is wrong after offlining pages.
      So there is no need to change NR_FREE_PAGES in step3.
      
      This patch also fixs a problem in step2: if the migratetype is
      MIGRATE_ISOLATE, we should not add NR_FRR_PAGES when we remove pages from
      pcppages.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jianguo Wu <wujianguo106@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d0da22
    • W
      memory-hotplug: skip HWPoisoned page when offlining pages · b023f468
      Wen Congyang 提交于
      hwpoisoned may be set when we offline a page by the sysfs interface
      /sys/devices/system/memory/soft_offline_page or
      /sys/devices/system/memory/hard_offline_page. If we don't clear
      this flag when onlining pages, this page can't be freed, and will
      not in free list. So we can't offline these pages again. So we
      should skip such page when offlining pages.
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b023f468
    • K
      mm: use IS_ENABLED(CONFIG_NUMA) instead of NUMA_BUILD · e5adfffc
      Kirill A. Shutemov 提交于
      We don't need custom NUMA_BUILD anymore, since we have handy
      IS_ENABLED().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5adfffc
    • R
      mm: show migration types in show_mem · 377e4f16
      Rabin Vincent 提交于
      This is useful to diagnose the reason for page allocation failure for
      cases where there appear to be several free pages.
      
      Example, with this alloc_pages(GFP_ATOMIC) failure:
      
       swapper/0: page allocation failure: order:0, mode:0x0
       ...
       Mem-info:
       Normal per-cpu:
       CPU    0: hi:   90, btch:  15 usd:  48
       CPU    1: hi:   90, btch:  15 usd:  21
       active_anon:0 inactive_anon:0 isolated_anon:0
        active_file:0 inactive_file:84 isolated_file:0
        unevictable:0 dirty:0 writeback:0 unstable:0
        free:4026 slab_reclaimable:75 slab_unreclaimable:484
        mapped:0 shmem:0 pagetables:0 bounce:0
       Normal free:16104kB min:2296kB low:2868kB high:3444kB active_anon:0kB
       inactive_anon:0kB active_file:0kB inactive_file:336kB unevictable:0kB
       isolated(anon):0kB isolated(file):0kB present:331776kB mlocked:0kB
       dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:300kB
       slab_unreclaimable:1936kB kernel_stack:328kB pagetables:0kB unstable:0kB
       bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
       lowmem_reserve[]: 0 0
      
      Before the patch, it's hard (for me, at least) to say why all these free
      chunks weren't considered for allocation:
      
       Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB 1*512kB
       1*1024kB 1*2048kB 3*4096kB = 16128kB
      
      After the patch, it's obvious that the reason is that all of these are
      in the MIGRATE_CMA (C) freelist:
      
       Normal: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 1*256kB (C) 1*512kB
       (C) 1*1024kB (C) 1*2048kB (C) 3*4096kB (C) = 16128kB
      Signed-off-by: NRabin Vincent <rabin.vincent@stericsson.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      377e4f16
  4. 11 12月, 2012 5 次提交
  5. 06 12月, 2012 1 次提交
  6. 01 12月, 2012 3 次提交
    • M
      mm: avoid waking kswapd for THP allocations when compaction is deferred or contended · 782fd304
      Mel Gorman 提交于
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      This patch defers when kswapd gets woken up for THP allocations.  For
      !THP allocations, kswapd is always woken up.  For THP allocations,
      kswapd is woken up iff the process is willing to enter into direct
      reclaim/compaction.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      782fd304
    • A
      revert "Revert "mm: remove __GFP_NO_KSWAPD"" · a5091539
      Andrew Morton 提交于
      It apepars that this patch was innocent, and we hope that "mm: avoid
      waking kswapd for THP allocations when compaction is deferred or
      contended" will fix the final kswapd-spinning cause.
      
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5091539
    • M
      mm: compaction: fix return value of capture_free_page() · 58d00209
      Mel Gorman 提交于
      Commit ef6c5be6 ("fix incorrect NR_FREE_PAGES accounting (appears
      like memory leak)") fixes a NR_FREE_PAGE accounting leak but missed the
      return value which was also missed by this reviewer until today.
      
      That return value is used by compaction when adding pages to a list of
      isolated free pages and without this follow-up fix, there is a risk of
      free list corruption.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58d00209
  7. 27 11月, 2012 1 次提交
    • M
      Revert "mm: remove __GFP_NO_KSWAPD" · 82b212f4
      Mel Gorman 提交于
      With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
      based on failures" reverted, Zdenek Kabelac reported the following
      
        Hmm,  so it's just took longer to hit the problem and observe
        kswapd0 spinning on my CPU again - it's not as endless like before -
        but still it easily eats minutes - it helps to	turn off  Firefox
        or TB  (memory hungry apps) so kswapd0 stops soon - and restart
        those apps again.  (And I still have like >1GB of cached memory)
      
        kswapd0         R  running task        0    30      2 0x00000000
        Call Trace:
          preempt_schedule+0x42/0x60
          _raw_spin_unlock+0x55/0x60
          put_super+0x31/0x40
          drop_super+0x22/0x30
          prune_super+0x149/0x1b0
          shrink_slab+0xba/0x510
      
      The sysrq+m indicates the system has no swap so it'll never reclaim
      anonymous pages as part of reclaim/compaction.  That is one part of the
      problem but not the root cause as file-backed pages could also be
      reclaimed.
      
      The likely underlying problem is that kswapd is woken up or kept awake
      for each THP allocation request in the page allocator slow path.
      
      If compaction fails for the requesting process then compaction will be
      deferred for a time and direct reclaim is avoided.  However, if there
      are a storm of THP requests that are simply rejected, it will still be
      the the case that kswapd is awake for a prolonged period of time as
      pgdat->kswapd_max_order is updated each time.  This is noticed by the
      main kswapd() loop and it will not call kswapd_try_to_sleep().  Instead
      it will loopp, shrinking a small number of pages and calling
      shrink_slab() on each iteration.
      
      The temptation is to supply a patch that checks if kswapd was woken for
      THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
      backed up by proper testing.  As 3.7 is very close to release and this
      is not a bug we should release with, a safer path is to revert "mm:
      remove __GFP_NO_KSWAPD" for now and revisit it with the view to ironing
      out the balance_pgdat() logic in general.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Zdenek Kabelac <zkabelac@redhat.com>
      Cc: Seth Jennings <sjenning@linux.vnet.ibm.com>
      Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82b212f4
  8. 22 11月, 2012 1 次提交
    • D
      fix incorrect NR_FREE_PAGES accounting (appears like memory leak) · ef6c5be6
      Dave Hansen 提交于
      There have been some 3.7-rc reports of vm issues, including some kswapd
      bugs and, more importantly, some memory "leaks":
      
      	http://www.spinics.net/lists/linux-mm/msg46187.html
      	https://bugzilla.kernel.org/show_bug.cgi?id=50181
      
      Commit 1fb3f8ca ("mm: compaction: capture a suitable high-order page
      immediately when it is made available") took split_free_page() and
      reused it for the compaction code.  It does something curious with
      capture_free_page() (previously known as split_free_page()):
      
        int capture_free_page(struct page *page, int alloc_order,
        ...
                __mod_zone_page_state(zone, NR_FREE_PAGES, -(1UL << order));
      
        -       /* Split into individual pages */
        -       set_page_refcounted(page);
        -       split_page(page, order);
        +       if (alloc_order != order)
        +               expand(zone, page, alloc_order, order,
        +                       &zone->free_area[order], migratetype);
      
      Note that expand() puts the pages _back_ in the allocator, but it does
      not bump NR_FREE_PAGES.  We "return" 'alloc_order' worth of pages, but
      we accounted for removing 'order' in the __mod_zone_page_state() call.
      
      For the old split_page()-style use (order==alloc_order) the bug will not
      trigger.  But, when called from the compaction code where we
      occasionally get a larger page out of the buddy allocator than we need,
      we will run in to this.
      
      This patch simply changes the NR_FREE_PAGES manipulation to the correct
      'alloc_order' instead of 'order'.
      
      I've been able to repeatedly trigger this in my testing environment.
      The amount "leaked" very closely tracks the imbalance I see in buddy
      pages vs.  NR_FREE_PAGES.  I have confirmed that this patch fixes the
      imbalance
      Signed-off-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef6c5be6
  9. 17 11月, 2012 2 次提交
    • A
      revert "mm: fix-up zone present pages" · 5576646f
      Andrew Morton 提交于
      Revert commit 7f1290f2 ("mm: fix-up zone present pages")
      
      That patch tried to fix a issue when calculating zone->present_pages,
      but it caused a regression on 32bit systems with HIGHMEM.  With that
      change, reset_zone_present_pages() resets all zone->present_pages to
      zero, and fixup_zone_present_pages() is called to recalculate
      zone->present_pages when the boot allocator frees core memory pages into
      buddy allocator.  Because highmem pages are not freed by bootmem
      allocator, all highmem zones' present_pages becomes zero.
      
      Various options for improving the situation are being discussed but for
      now, let's return to the 3.6 code.
      
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NChris Clayton <chris2553@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5576646f
    • H
      memcg: fix hotplugged memory zone oops · bea8c150
      Hugh Dickins 提交于
      When MEMCG is configured on (even when it's disabled by boot option),
      when adding or removing a page to/from its lru list, the zone pointer
      used for stats updates is nowadays taken from the struct lruvec.  (On
      many configurations, calculating zone from page is slower.)
      
      But we have no code to update all the lruvecs (per zone, per memcg) when
      a memory node is hotadded.  Here's an extract from the oops which
      results when running numactl to bind a program to a newly onlined node:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60
        IP:  __mod_zone_page_state+0x9/0x60
        Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs
        Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0)
        Call Trace:
          __pagevec_lru_add_fn+0xdf/0x140
          pagevec_lru_move_fn+0xb1/0x100
          __pagevec_lru_add+0x1c/0x30
          lru_add_drain_cpu+0xa3/0x130
          lru_add_drain+0x2f/0x40
         ...
      
      The natural solution might be to use a memcg callback whenever memory is
      hotadded; but that solution has not been scoped out, and it happens that
      we do have an easy location at which to update lruvec->zone.  The lruvec
      pointer is discovered either by mem_cgroup_zone_lruvec() or by
      mem_cgroup_page_lruvec(), and both of those do know the right zone.
      
      So check and set lruvec->zone in those; and remove the inadequate
      attempt to set lruvec->zone from lruvec_init(), which is called before
      NODE_DATA(node) has been allocated in such cases.
      
      Ah, there was one exceptionr.  For no particularly good reason,
      mem_cgroup_force_empty_list() has its own code for deciding lruvec.
      Change it to use the standard mem_cgroup_zone_lruvec() and
      mem_cgroup_get_lru_size() too.  In fact it was already safe against such
      an oops (the lru lists in danger could only be empty), but we're better
      proofed against future changes this way.
      
      I've marked this for stable (3.6) since we introduced the problem in 3.5
      (now closed to stable); but I have no idea if this is the only fix
      needed to get memory hotadd working with memcg in 3.6, and received no
      answer when I enquired twice before.
      Reported-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea8c150
  10. 26 10月, 2012 2 次提交
  11. 23 10月, 2012 1 次提交
  12. 09 10月, 2012 9 次提交
    • M
      cma: decrease cc.nr_migratepages after reclaiming pagelist · beb51eaa
      Minchan Kim 提交于
      reclaim_clean_pages_from_list() reclaims clean pages before migration so
      cc.nr_migratepages should be updated.  Currently, there is no problem but
      it can be wrong if we try to use the value in future.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      beb51eaa
    • M
      CMA: migrate mlocked pages · e46a2879
      Minchan Kim 提交于
      Presently CMA cannot migrate mlocked pages so it ends up failing to allocate
      contiguous memory space.
      
      This patch makes mlocked pages be migrated out.  Of course, it can affect
      realtime processes but in CMA usecase, contiguous memory allocation failing
      is far worse than access latency to an mlocked page being variable while
      CMA is running.  If someone wants to make the system realtime, he shouldn't
      enable CMA because stalls can still happen at random times.
      
      [akpm@linux-foundation.org: tweak comment text, per Mel]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e46a2879
    • M
      memory-hotplug: fix zone stat mismatch · 5a883813
      Minchan Kim 提交于
      During memory-hotplug, I found NR_ISOLATED_[ANON|FILE] are increasing,
      causing the kernel to hang.  When the system doesn't have enough free
      pages, it enters reclaim but never reclaim any pages due to
      too_many_isolated()==true and loops forever.
      
      The cause is that when we do memory-hotadd after memory-remove,
      __zone_pcp_update() clears a zone's ZONE_STAT_ITEMS in setup_pageset()
      although the vm_stat_diff of all CPUs still have values.
      
      In addtion, when we offline all pages of the zone, we reset them in
      zone_pcp_reset without draining so we loss some zone stat item.
      Reviewed-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a883813
    • D
      mm, numa: reclaim from all nodes within reclaim distance · 957f822a
      David Rientjes 提交于
      RECLAIM_DISTANCE represents the distance between nodes at which it is
      deemed too costly to allocate from; it's preferred to try to reclaim from
      a local zone before falling back to allocating on a remote node with such
      a distance.
      
      To do this, zone_reclaim_mode is set if the distance between any two
      nodes on the system is greather than this distance.  This, however, ends
      up causing the page allocator to reclaim from every zone regardless of
      its affinity.
      
      What we really want is to reclaim only from zones that are closer than
      RECLAIM_DISTANCE.  This patch adds a nodemask to each node that
      represents the set of nodes that are within this distance.  During the
      zone iteration, if the bit for a zone's node is set for the local node,
      then reclaim is attempted; otherwise, the zone is skipped.
      
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      957f822a
    • H
      mm: remove free_page_mlock · a0c5e813
      Hugh Dickins 提交于
      We should not be seeing non-0 unevictable_pgs_mlockfreed any longer.  So
      remove free_page_mlock() from the page freeing paths: __PG_MLOCKED is
      already in PAGE_FLAGS_CHECK_AT_FREE, so free_pages_check() will now be
      checking it, reporting "BUG: Bad page state" if it's ever found set.
      Comment UNEVICTABLE_MLOCKFREED and unevictable_pgs_mlockfreed always 0.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a0c5e813
    • J
      mm: fix-up zone present pages · 7f1290f2
      Jianguo Wu 提交于
      I think zone->present_pages indicates pages that buddy system can management,
      it should be:
      
      	zone->present_pages = spanned pages - absent pages - bootmem pages,
      
      but is now:
      	zone->present_pages = spanned pages - absent pages - memmap pages.
      
      spanned pages: total size, including holes.
      absent pages: holes.
      bootmem pages: pages used in system boot, managed by bootmem allocator.
      memmap pages: pages used by page structs.
      
      This may cause zone->present_pages less than it should be.  For example,
      numa node 1 has ZONE_NORMAL and ZONE_MOVABLE, it's memmap and other
      bootmem will be allocated from ZONE_MOVABLE, so ZONE_NORMAL's
      present_pages should be spanned pages - absent pages, but now it also
      minus memmap pages(free_area_init_core), which are actually allocated from
      ZONE_MOVABLE.  When offlining all memory of a zone, this will cause
      zone->present_pages less than 0, because present_pages is unsigned long
      type, it is actually a very large integer, it indirectly caused
      zone->watermark[WMARK_MIN] becomes a large
      integer(setup_per_zone_wmarks()), than cause totalreserve_pages become a
      large integer(calculate_totalreserve_pages()), and finally cause memory
      allocating failure when fork process(__vm_enough_memory()).
      
      [root@localhost ~]# dmesg
      -bash: fork: Cannot allocate memory
      
      I think the bug described in
      
        http://marc.info/?l=linux-mm&m=134502182714186&w=2
      
      is also caused by wrong zone present pages.
      
      This patch intends to fix-up zone->present_pages when memory are freed to
      buddy system on x86_64 and IA64 platforms.
      Signed-off-by: NJianguo Wu <wujianguo@huawei.com>
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Reported-by: NPetr Tesarik <ptesarik@suse.cz>
      Tested-by: NPetr Tesarik <ptesarik@suse.cz>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f1290f2
    • M
      mm/page_alloc: refactor out __alloc_contig_migrate_alloc() · 723a0644
      Minchan Kim 提交于
      __alloc_contig_migrate_alloc() can be used by memory-hotplug so refactor
      it out (move + rename as a common name) into page_isolation.c.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      723a0644
    • M
      mm: compaction: clear PG_migrate_skip based on compaction and reclaim activity · 62997027
      Mel Gorman 提交于
      Compaction caches if a pageblock was scanned and no pages were isolated so
      that the pageblocks can be skipped in the future to reduce scanning.  This
      information is not cleared by the page allocator based on activity due to
      the impact it would have to the page allocator fast paths.  Hence there is
      a requirement that something clear the cache or pageblocks will be skipped
      forever.  Currently the cache is cleared if there were a number of recent
      allocation failures and it has not been cleared within the last 5 seconds.
      Time-based decisions like this are terrible as they have no relationship
      to VM activity and is basically a big hammer.
      
      Unfortunately, accurate heuristics would add cost to some hot paths so
      this patch implements a rough heuristic.  There are two cases where the
      cache is cleared.
      
      1. If a !kswapd process completes a compaction cycle (migrate and free
         scanner meet), the zone is marked compact_blockskip_flush. When kswapd
         goes to sleep, it will clear the cache. This is expected to be the
         common case where the cache is cleared. It does not really matter if
         kswapd happens to be asleep or going to sleep when the flag is set as
         it will be woken on the next allocation request.
      
      2. If there have been multiple failures recently and compaction just
         finished being deferred then a process will clear the cache and start a
         full scan.  This situation happens if there are multiple high-order
         allocation requests under heavy memory pressure.
      
      The clearing of the PG_migrate_skip bits and other scans is inherently
      racy but the race is harmless.  For allocations that can fail such as THP,
      they will simply fail.  For requests that cannot fail, they will retry the
      allocation.  Tests indicated that scanning rates were roughly similar to
      when the time-based heuristic was used and the allocation success rates
      were similar.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62997027
    • M
      mm: compaction: cache if a pageblock was scanned and no pages were isolated · bb13ffeb
      Mel Gorman 提交于
      When compaction was implemented it was known that scanning could
      potentially be excessive.  The ideal was that a counter be maintained for
      each pageblock but maintaining this information would incur a severe
      penalty due to a shared writable cache line.  It has reached the point
      where the scanning costs are a serious problem, particularly on
      long-lived systems where a large process starts and allocates a large
      number of THPs at the same time.
      
      Instead of using a shared counter, this patch adds another bit to the
      pageblock flags called PG_migrate_skip.  If a pageblock is scanned by
      either migrate or free scanner and 0 pages were isolated, the pageblock is
      marked to be skipped in the future.  When scanning, this bit is checked
      before any scanning takes place and the block skipped if set.
      
      The main difficulty with a patch like this is "when to ignore the cached
      information?" If it's ignored too often, the scanning rates will still be
      excessive.  If the information is too stale then allocations will fail
      that might have otherwise succeeded.  In this patch
      
      o CMA always ignores the information
      o If the migrate and free scanner meet then the cached information will
        be discarded if it's at least 5 seconds since the last time the cache
        was discarded
      o If there are a large number of allocation failures, discard the cache.
      
      The time-based heuristic is very clumsy but there are few choices for a
      better event.  Depending solely on multiple allocation failures still
      allows excessive scanning when THP allocations are failing in quick
      succession due to memory pressure.  Waiting until memory pressure is
      relieved would cause compaction to continually fail instead of using
      reclaim/compaction to try allocate the page.  The time-based mechanism is
      clumsy but a better option is not obvious.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Richard Davies <richard@arachsys.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mark Brown <broonie@opensource.wolfsonmicro.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb13ffeb