1. 03 2月, 2011 4 次提交
  2. 02 2月, 2011 1 次提交
  3. 28 1月, 2011 2 次提交
  4. 26 1月, 2011 9 次提交
    • K
      memcg: fix race at move_parent around compound_order() · 52dbb905
      KAMEZAWA Hiroyuki 提交于
      A fix up mem_cgroup_move_parent() which use compound_order() in
      asynchronous manner.  This compound_order() may return unknown value
      because we don't take lock.  Use PageTransHuge() and HPAGE_SIZE instead
      of it.
      
      Also clean up for mem_cgroup_move_parent().
       - remove unnecessary initialization of local variable.
       - rename charge_size -> page_size
       - remove unnecessary (wrong) comment.
       - added a comment about THP.
      
      Note:
       Current design take compound_page_lock() in caller of move_account().
       This should be revisited when we implement direct move_task of hugepage
       without splitting.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52dbb905
    • K
      memcg: bugfix check mem_cgroup_disabled() at split fixup · 3d37c4a9
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_disabled() should be checked at splitting.  If disabled, no
      heavy work is necesary.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d37c4a9
    • K
      memcg: fix account leak at failure of memsw acconting · 01c88e2d
      KAMEZAWA Hiroyuki 提交于
      Commit 4b534334 ("memcg: clean up try_charge main loop") removes a
      cancel of charge at case: memory charge-> success.  mem+swap charge->
      failure.
      
      This leaks usage of memory.  Fix it.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: <stable@kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c88e2d
    • M
      mm: migration: clarify migrate_pages() comment · 28bd6578
      Minchan Kim 提交于
      Callers of migrate_pages should putback_lru_pages to return pages
      isolated to LRU or free list.  Now comment is rather confusing.  It says
      caller always have to call it.
      
      It is more clear to point out that the caller has to call it if
      migrate_pages's return value isn't zero.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28bd6578
    • A
      mm: compaction: don't depend on HUGETLB_PAGE · 33a93877
      Andrea Arcangeli 提交于
      Commit 5d689240 ("thp: select CONFIG_COMPACTION if TRANSPARENT_HUGEPAGE
      enabled") causes this warning during the configuration process:
      
        warning: (TRANSPARENT_HUGEPAGE) selects COMPACTION which has unmet
        direct dependencies (EXPERIMENTAL && HUGETLB_PAGE && MMU)
      
      COMPACTION doesn't depend on HUGETLB_PAGE, it doesn't depend on THP
      either, it is also useful for regular alloc_pages(order > 0) including
      the very kernel stack during fork (THREAD_ORDER = 1).  It's always
      better to enable COMPACTION.
      
      The warning should be an error because we would end up with MIGRATION
      not selected, and COMPACTION wouldn't work without migration (despite it
      seems to build with an inline migrate_pages returning -ENOSYS).
      
      I'd also like to remove EXPERIMENTAL: compaction has been in the kernel
      for some releases (for full safety the default remains disabled which I
      think is enough).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NLuca Tettamanti <kronos.it@gmail.com>
      Tested-by: NLuca Tettamanti <kronos.it@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33a93877
    • J
      mm/memcontrol.c: fix uninitialized variable use in mem_cgroup_move_parent() · 8dba474f
      Jesper Juhl 提交于
      In mm/memcontrol.c::mem_cgroup_move_parent() there's a path that jumps
      to the 'put_back' label
      
        	ret = __mem_cgroup_try_charge(NULL, gfp_mask, &parent, false, charge);
        	if (ret || !parent)
        		goto put_back;
      
      where we'll
      
        	if (charge > PAGE_SIZE)
        		compound_unlock_irqrestore(page, flags);
      
      but, we have not assigned anything to 'flags' at this point, nor have we
      called 'compound_lock_irqsave()' (which is what sets 'flags').  The
      'put_back' label should be moved below the call to
      compound_unlock_irqrestore() as per this patch.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dba474f
    • D
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator · 2ff754fa
      David Rientjes 提交于
      Commit 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone") uncovered a livelock in the page
      allocator that resulted in tasks infinitely looping trying to find
      memory and kswapd running at 100% cpu.
      
      The issue occurs because drain_all_pages() is called immediately
      following direct reclaim when no memory is freed and try_to_free_pages()
      returns non-zero because all zones in the zonelist do not have their
      all_unreclaimable flag set.
      
      When draining the per-cpu pagesets back to the buddy allocator for each
      zone, the zone->pages_scanned counter is cleared to avoid erroneously
      setting zone->all_unreclaimable later.  The problem is that no pages may
      actually be drained and, thus, the unreclaimable logic never fails
      direct reclaim so the oom killer may be invoked.
      
      This apparently only manifested after wait_iff_congested() was
      introduced and the zone was full of anonymous memory that would not
      congest the backing store.  The page allocator would infinitely loop if
      there were no other tasks waiting to be scheduled and clear
      zone->pages_scanned because of drain_all_pages() as the result of this
      change before kswapd could scan enough pages to trigger the reclaim
      logic.  Additionally, with every loop of the page allocator and in the
      reclaim path, kswapd would be kicked and would end up running at 100%
      cpu.  In this scenario, current and kswapd are all running continuously
      with kswapd incrementing zone->pages_scanned and current clearing it.
      
      The problem is even more pronounced when current swaps some of its
      memory to swap cache and the reclaimable logic then considers all active
      anonymous memory in the all_unreclaimable logic, which requires a much
      higher zone->pages_scanned value for try_to_free_pages() to return zero
      that is never attainable in this scenario.
      
      Before wait_iff_congested(), the page allocator would incur an
      unconditional timeout and allow kswapd to elevate zone->pages_scanned to
      a level that the oom killer would be called the next time it loops.
      
      The fix is to only attempt to drain pcp pages if there is actually a
      quantity to be drained.  The unconditional clearing of
      zone->pages_scanned in free_pcppages_bulk() need not be changed since
      other callers already ensure that draining will occur.  This patch
      ensures that free_pcppages_bulk() will actually free memory before
      calling into it from drain_all_pages() so zone->pages_scanned is only
      cleared if appropriate.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff754fa
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
    • A
      mm/pgtable-generic.c: fix CONFIG_SWAP=n build · f95ba941
      Andrew Morton 提交于
      mips (and sparc32):
      
        In file included from arch/mips/include/asm/tlb.h:21,
                         from mm/pgtable-generic.c:9:
        include/asm-generic/tlb.h: In function `tlb_flush_mmu':
        include/asm-generic/tlb.h:76: error: implicit declaration of function `release_pages'
        include/asm-generic/tlb.h: In function `tlb_remove_page':
        include/asm-generic/tlb.h:105: error: implicit declaration of function `page_cache_release'
      
      free_pages_and_swap_cache() and free_page_and_swap_cache() are macros
      which call release_pages() and page_cache_release().  The obvious fix is
      to include pagemap.h in swap.h, where those macros are defined.  But that
      breaks sparc for weird reasons.
      
      So fix it within mm/pgtable-generic.c instead.
      Reported-by: NYoichi Yuasa <yuasa@linux-mips.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NSam Ravnborg <sam@ravnborg.org>
      Cc: Sergei Shtylyov <sshtylyov@mvista.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f95ba941
  5. 21 1月, 2011 10 次提交
  6. 18 1月, 2011 2 次提交
    • L
      Revert "mm: simplify code of swap.c" · 83896fb5
      Linus Torvalds 提交于
      This reverts commit d8505dee.
      
      Chris Mason ended up chasing down some page allocation errors and pages
      stuck waiting on the IO scheduler, and was able to narrow it down to two
      commits: commit 744ed144 ("mm: batch activate_page() to reduce lock
      contention") and d8505dee ("mm: simplify code of swap.c").
      
      This reverts the second one.
      Reported-and-debugged-by: NChris Mason <chris.mason@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83896fb5
    • L
      Revert "mm: batch activate_page() to reduce lock contention" · 7a608572
      Linus Torvalds 提交于
      This reverts commit 744ed144.
      
      Chris Mason ended up chasing down some page allocation errors and pages
      stuck waiting on the IO scheduler, and was able to narrow it down to two
      commits: commit 744ed144 ("mm: batch activate_page() to reduce lock
      contention") and d8505dee ("mm: simplify code of swap.c").
      
      This reverts the first of them.
      Reported-and-debugged-by: NChris Mason <chris.mason@oracle.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <jaxboe@fusionio.com>
      Cc: linux-mm <linux-mm@kvack.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a608572
  7. 17 1月, 2011 1 次提交
    • A
      fix non-x86 build failure in pmdp_get_and_clear · b3697c02
      Andrea Arcangeli 提交于
      pmdp_get_and_clear/pmdp_clear_flush/pmdp_splitting_flush were trapped as
      BUG() and they were defined only to diminish the risk of build issues on
      not-x86 archs and to be consistent with the generic pte methods previously
      defined in include/asm-generic/pgtable.h.
      
      But they are causing more trouble than they were supposed to solve, so
      it's simpler not to define them when THP is off.
      
      This is also correcting the export of pmdp_splitting_flush which is
      currently unused (x86 isn't using the generic implementation in
      mm/pgtable-generic.c and no other arch needs that [yet]).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Sam Ravnborg <sam@ravnborg.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3697c02
  8. 15 1月, 2011 1 次提交
  9. 14 1月, 2011 10 次提交
    • D
      memcg: fix memory migration of shmem swapcache · 50de1dd9
      Daisuke Nishimura 提交于
      In the current implementation mem_cgroup_end_migration() decides whether
      the page migration has succeeded or not by checking "oldpage->mapping".
      
      But if we are tring to migrate a shmem swapcache, the page->mapping of it
      is NULL from the begining, so the check would be invalid.  As a result,
      mem_cgroup_end_migration() assumes the migration has succeeded even if
      it's not, so "newpage" would be freed while it's not uncharged.
      
      This patch fixes it by passing mem_cgroup_end_migration() the result of
      the page migration.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50de1dd9
    • J
      memcg: use [kv]zalloc[_node] rather than [kv]malloc+memset · 17295c88
      Jesper Juhl 提交于
      In mem_cgroup_alloc() we currently do either kmalloc() or vmalloc() then
      followed by memset() to zero the memory.  This can be more efficiently
      achieved by using kzalloc() and vzalloc().  There's also one situation
      where we can use kzalloc_node() - this is what's new in this version of
      the patch.
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17295c88
    • D
      memcg: fix deadlock between cpuset and memcg · dfe076b0
      Daisuke Nishimura 提交于
      Commit b1dd693e ("memcg: avoid deadlock between move charge and
      try_charge()") can cause another deadlock about mmap_sem on task migration
      if cpuset and memcg are mounted onto the same mount point.
      
      After the commit, cgroup_attach_task() has sequence like:
      
      cgroup_attach_task()
        ss->can_attach()
          cpuset_can_attach()
          mem_cgroup_can_attach()
            down_read(&mmap_sem)        (1)
        ss->attach()
          cpuset_attach()
            mpol_rebind_mm()
              down_write(&mmap_sem)     (2)
              up_write(&mmap_sem)
            cpuset_migrate_mm()
              do_migrate_pages()
                down_read(&mmap_sem)
                up_read(&mmap_sem)
          mem_cgroup_move_task()
            mem_cgroup_clear_mc()
              up_read(&mmap_sem)
      
      We can cause deadlock at (2) because we've already aquire the mmap_sem at (1).
      
      But the commit itself is necessary to fix deadlocks which have existed
      before the commit like:
      
      Ex.1)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |  down_write(&mmap_sem)
            mc.moving_task = current          |    ..
            mem_cgroup_precharge_mc()         |  __mem_cgroup_try_charge()
              mem_cgroup_count_precharge()    |    prepare_to_wait()
                down_read(&mmap_sem)          |    if (mc.moving_task)
                -> cannot aquire the lock     |    -> true
                                              |      schedule()
                                              |      -> move charge should wake it up
      
      Ex.2)
                      move charge             |        try charge
        --------------------------------------+------------------------------
          mem_cgroup_can_attach()             |
            mc.moving_task = current          |
            mem_cgroup_precharge_mc()         |
              mem_cgroup_count_precharge()    |
                down_read(&mmap_sem)          |
                ..                            |
                up_read(&mmap_sem)            |
                                              |  down_write(&mmap_sem)
          mem_cgroup_move_task()              |    ..
            mem_cgroup_move_charge()          |  __mem_cgroup_try_charge()
              down_read(&mmap_sem)            |    prepare_to_wait()
              -> cannot aquire the lock       |    if (mc.moving_task)
                                              |    -> true
                                              |      schedule()
                                              |      -> move charge should wake it up
      
      This patch fixes all of these problems by:
      1. revert the commit.
      2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge()
         has released the mmap_sem.
      3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in
         mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel
         all extra charges, wake up all waiters, and retry trylock.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfe076b0
    • M
    • J
      memcg: fix unit mismatch in memcg oom limit calculation · f3e8eb70
      Johannes Weiner 提交于
      Adding the number of swap pages to the byte limit of a memory control
      group makes no sense.  Convert the pages to bytes before adding them.
      
      The only user of this code is the OOM killer, and the way it is used means
      that the error results in a higher OOM badness value.  Since the cgroup
      limit is the same for all tasks in the cgroup, the error should have no
      practical impact at the moment.
      
      But let's not wait for future or changing users to trip over it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3e8eb70
    • K
      memcg: add lock to synchronize page accounting and migration · dbd4ea78
      KAMEZAWA Hiroyuki 提交于
      Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page
      accounting and migration code.  This reworks the locking scheme of
      _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK,
      which is always taken under IRQ disable.
      
      1. If pages are being migrated from a memcg, then updates to that
         memcg page statistics are protected by grabbing PCG_MOVE_LOCK using
         move_lock_page_cgroup().  In an upcoming commit, memcg dirty page
         accounting will be updating memcg page accounting (specifically: num
         writeback pages) from IRQ context (softirq).  Avoid a deadlocking
         nested spin lock attempt by disabling irq on the local processor when
         grabbing the PCG_MOVE_LOCK.
      
      2. lock for update_page_stat is used only for avoiding race with
         move_account().  So, IRQ awareness of lock_page_cgroup() itself is not
         a problem.  The problem is between mem_cgroup_update_page_stat() and
         mem_cgroup_move_account_page().
      
      Trade-off:
        * Changing lock_page_cgroup() to always disable IRQ (or
          local_bh) has some impacts on performance and I think
          it's bad to disable IRQ when it's not necessary.
        * adding a new lock makes move_account() slower.  Score is
          here.
      
      Performance Impact: moving a 8G anon process.
      
      Before:
      	real    0m0.792s
      	user    0m0.000s
      	sys     0m0.780s
      
      After:
      	real    0m0.854s
      	user    0m0.000s
      	sys     0m0.842s
      
      This score is bad but planned patches for optimization can reduce
      this impact.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Andrea Righi <arighi@develer.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbd4ea78
    • G
      memcg: create extensible page stat update routines · 2a7106f2
      Greg Thelen 提交于
      Replace usage of the mem_cgroup_update_file_mapped() memcg
      statistic update routine with two new routines:
      * mem_cgroup_inc_page_stat()
      * mem_cgroup_dec_page_stat()
      
      As before, only the file_mapped statistic is managed.  However, these more
      general interfaces allow for new statistics to be more easily added.  New
      statistics are added with memcg dirty page accounting.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NAndrea Righi <arighi@develer.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a7106f2
    • S
      mm: batch activate_page() to reduce lock contention · 744ed144
      Shaohua Li 提交于
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      I tested some microbenchmarks:
      case-anon-cow-rand-mt		0.58%
      case-anon-cow-rand		-3.30%
      case-anon-cow-seq-mt		-0.51%
      case-anon-cow-seq		-5.68%
      case-anon-r-rand-mt		0.23%
      case-anon-r-rand		0.81%
      case-anon-r-seq-mt		-0.71%
      case-anon-r-seq			-1.99%
      case-anon-rx-rand-mt		2.11%
      case-anon-rx-seq-mt		3.46%
      case-anon-w-rand-mt		-0.03%
      case-anon-w-rand		-0.50%
      case-anon-w-seq-mt		-1.08%
      case-anon-w-seq			-0.12%
      case-anon-wx-rand-mt		-5.02%
      case-anon-wx-seq-mt		-1.43%
      case-fork			1.65%
      case-fork-sleep			-0.07%
      case-fork-withmem		1.39%
      case-hugetlb			-0.59%
      case-lru-file-mmap-read-mt	-0.54%
      case-lru-file-mmap-read		0.61%
      case-lru-file-mmap-read-rand	-2.24%
      case-lru-file-readonce		-0.64%
      case-lru-file-readtwice		-11.69%
      case-lru-memcg			-1.35%
      case-mmap-pread-rand-mt		1.88%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq-mt		0.89%
      case-mmap-pread-seq		-69.72%
      case-mmap-xread-rand-mt		0.71%
      case-mmap-xread-seq-mt		0.38%
      
      The most significent are:
      case-lru-file-readtwice		-11.69%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq		-69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      744ed144
    • S
      mm: simplify code of swap.c · d8505dee
      Shaohua Li 提交于
      Clean up code and remove duplicate code.  Next patch will use
      pagevec_lru_move_fn introduced here too.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8505dee
    • A
      mm/page_alloc.c: don't cache `current' in a local · c06b1fca
      Andrew Morton 提交于
      It's old-fashioned and unneeded.
      
      akpm:/usr/src/25> size mm/page_alloc.o
         text    data     bss     dec     hex filename
        39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
        39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c06b1fca