1. 09 1月, 2009 16 次提交
    • K
      memcg: fix reclaim result checks · 88700756
      KAMEZAWA Hiroyuki 提交于
      check_under_limit logic was wrong and this check should be against
      mem_over_limit rather than mem.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jan Blunck <jblunck@suse.de>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88700756
    • K
      memcg: avoid unnecessary system-wide-oom-killer · a636b327
      KAMEZAWA Hiroyuki 提交于
      Current mmtom has new oom function as pagefault_out_of_memory().  It's
      added for select bad process rathar than killing current.
      
      When memcg hit limit and calls OOM at page_fault, this handler called and
      system-wide-oom handling happens.  (means kernel panics if panic_on_oom is
      true....)
      
      To avoid overkill, check memcg's recent behavior before starting
      system-wide-oom.
      
      And this patch also fixes to guarantee "don't accnout against process with
      TIF_MEMDIE".  This is necessary for smooth OOM.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jan Blunck <jblunck@suse.de>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a636b327
    • B
      memcg: memory cgroup hierarchy feature selector · 18f59ea7
      Balbir Singh 提交于
      Don't enable multiple hierarchy support by default.  This patch introduces
      a features element that can be set to enable the nested depth hierarchy
      feature.  This feature can only be enabled when the cgroup for which the
      feature this is enabled, has no children.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18f59ea7
    • B
      memcg: memory cgroup hierarchical reclaim · 6d61ef40
      Balbir Singh 提交于
      This patch introduces hierarchical reclaim.  When an ancestor goes over
      its limit, the charging routine points to the parent that is above its
      limit.  The reclaim process then starts from the last scanned child of the
      ancestor and reclaims until the ancestor goes below its limit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d61ef40
    • B
      memcg: memory cgroup resource counters for hierarchy · 28dbc4b6
      Balbir Singh 提交于
      Add support for building hierarchies in resource counters.  Cgroups allows
      us to build a deep hierarchy, but we currently don't link the resource
      counters belonging to the memory controller control groups, in the same
      fashion as the corresponding cgroup entries in the cgroup hierarchy.  This
      patch provides the infrastructure for resource counters that have the same
      hiearchy as their cgroup counter parts.
      
      These set of patches are based on the resource counter hiearchy patches
      posted by Pavel Emelianov.
      
      NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
      the all the way up to the root.  It is known that hiearchies are
      expensive, so the user needs to be careful and aware of the trade-offs
      before creating very deep ones.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28dbc4b6
    • H
      memcg: add mem_cgroup_disabled() · f8d66542
      Hirokazu Takahashi 提交于
      We check mem_cgroup is disabled or not by checking
      mem_cgroup_subsys.disabled.  I think it has more references than expected,
      now.
      
      replacing
         if (mem_cgroup_subsys.disabled)
      with
         if (mem_cgroup_disabled())
      
      give us good look, I think.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix typo]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8d66542
    • K
      memcg: synchronized LRU · 08e552c6
      KAMEZAWA Hiroyuki 提交于
      A big patch for changing memcg's LRU semantics.
      
      Now,
        - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
      
        - LRU of page_cgroup is not synchronous with global LRU.
      
        - page and page_cgroup is one-to-one and statically allocated.
      
        - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
          - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
      
        - SwapCache is handled.
      
      And, when we handle LRU list of page_cgroup, we do following.
      
      	pc = lookup_page_cgroup(page);
      	lock_page_cgroup(pc); .....................(1)
      	mz = page_cgroup_zoneinfo(pc);
      	spin_lock(&mz->lru_lock);
      	.....add to LRU
      	spin_unlock(&mz->lru_lock);
      	unlock_page_cgroup(pc);
      
      But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
      So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
      
      This is a trial to remove this dirty nesting of locks.
      This patch changes mz->lru_lock to be zone->lru_lock.
      Then, above sequence will be written as
      
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      	mem_cgroup_add/remove/etc_lru() {
      		pc = lookup_page_cgroup(page);
      		mz = page_cgroup_zoneinfo(pc);
      		if (PageCgroupUsed(pc)) {
      			....add to LRU
      		}
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      
      This is much simpler.
      (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
          1. When pc->mem_cgroup can be modified.
             - at charge.
             - at account_move().
          2. at charge
             the PCG_USED bit is not set before pc->mem_cgroup is fixed.
          3. at account_move()
             the page is isolated and not on LRU.
      
      Pros.
        - easy for maintenance.
        - memcg can make use of laziness of pagevec.
        - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
        - LRU status of memcg will be synchronized with global LRU's one.
        - # of locks are reduced.
        - account_move() is simplified very much.
      Cons.
        - may increase cost of LRU rotation.
          (no impact if memcg is not configured.)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08e552c6
    • K
      memcg: mem+swap controller core · 8c7c6e34
      KAMEZAWA Hiroyuki 提交于
      This patch implements per cgroup limit for usage of memory+swap.  However
      there are SwapCache, double counting of swap-cache and swap-entry is
      avoided.
      
      Mem+Swap controller works as following.
        - memory usage is limited by memory.limit_in_bytes.
        - memory + swap usage is limited by memory.memsw_limit_in_bytes.
      
      This has following benefits.
        - A user can limit total resource usage of mem+swap.
      
          Without this, because memory resource controller doesn't take care of
          usage of swap, a process can exhaust all the swap (by memory leak.)
          We can avoid this case.
      
          And Swap is shared resource but it cannot be reclaimed (goes back to memory)
          until it's used. This characteristic can be trouble when the memory
          is divided into some parts by cpuset or memcg.
          Assume group A and group B.
          After some application executes, the system can be..
      
          Group A -- very large free memory space but occupy 99% of swap.
          Group B -- under memory shortage but cannot use swap...it's nearly full.
      
          Ability to set appropriate swap limit for each group is required.
      
      Maybe someone wonder "why not swap but mem+swap ?"
      
        - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
          to move account from memory to swap...there is no change in usage of
          mem+swap.
      
          In other words, when we want to limit the usage of swap without affecting
          global LRU, mem+swap limit is better than just limiting swap.
      
      Accounting target information is stored in swap_cgroup which is
      per swap entry record.
      
      Charge is done as following.
        map
          - charge  page and memsw.
      
        unmap
          - uncharge page/memsw if not SwapCache.
      
        swap-out (__delete_from_swap_cache)
          - uncharge page
          - record mem_cgroup information to swap_cgroup.
      
        swap-in (do_swap_page)
          - charged as page and memsw.
            record in swap_cgroup is cleared.
            memsw accounting is decremented.
      
        swap-free (swap_free())
          - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
      
      There are people work under never-swap environments and consider swap as
      something bad. For such people, this mem+swap controller extension is just an
      overhead.  This overhead is avoided by config or boot option.
      (see Kconfig. detail is not in this patch.)
      
      TODO:
       - maybe more optimization can be don in swap-in path. (but not very safe.)
         But we just do simple accounting at this stage.
      
      [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
      [hugh@veritas.com: memswap controller core swapcache fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c7c6e34
    • K
      memcg: mem+swap controller Kconfig · c077719b
      KAMEZAWA Hiroyuki 提交于
      Config and control variable for mem+swap controller.
      
      This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      (memory resource controller swap extension.)
      
      For accounting swap, it's obvious that we have to use additional memory to
      remember "who uses swap".  This adds more overhead.  So, it's better to
      offer "choice" to users.  This patch adds 2 choices.
      
      This patch adds 2 parameters to enable swap extension or not.
        - CONFIG
        - boot option
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c077719b
    • K
      memcg: handle swap caches · d13d1443
      KAMEZAWA Hiroyuki 提交于
      SwapCache support for memory resource controller (memcg)
      
      Before mem+swap controller, memcg itself should handle SwapCache in proper
      way.  This is cut-out from it.
      
      In current memcg, SwapCache is just leaked and the user can create tons of
      SwapCache.  This is a leak of account and should be handled.
      
      SwapCache accounting is done as following.
      
        charge (anon)
      	- charged when it's mapped.
      	  (because of readahead, charge at add_to_swap_cache() is not sane)
        uncharge (anon)
      	- uncharged when it's dropped from swapcache and fully unmapped.
      	  means it's not uncharged at unmap.
      	  Note: delete from swap cache at swap-in is done after rmap information
      	        is established.
        charge (shmem)
      	- charged at swap-in. this prevents charge at add_to_page_cache().
      
        uncharge (shmem)
      	- uncharged when it's dropped from swapcache and not on shmem's
      	  radix-tree.
      
        at migration, check against 'old page' is modified to handle shmem.
      
      Comparing to the old version discussed (and caused troubles), we have
      advantages of
        - PCG_USED bit.
        - simple migrating handling.
      
      So, situation is much easier than several months ago, maybe.
      
      [hugh@veritas.com: memcg: handle swap caches build fix]
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d13d1443
    • K
      memcg: new force_empty to free pages under group · c1e862c1
      KAMEZAWA Hiroyuki 提交于
      By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
      memory usage and force_empty is removed.
      
      This patch adds "force_empty" again, in reasonable manner.
      
      memory.force_empty file works when
      
        #echo 0 (or some) > memory.force_empty
        and have following function.
      
        1. only works when there are no task in this cgroup.
        2. free all page under this cgroup as much as possible.
        3. page which cannot be freed will be moved up to parent.
        4. Then, memcg will be empty after above echo returns.
      
      This is much better behavior than old "force_empty" which just forget
      all accounts. This patch also check signal_pending() and above "echo"
      can be stopped by "Ctrl-C".
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e862c1
    • J
      memcg: reduce size of mem_cgroup by using nr_cpu_ids · c8dad2bb
      Jan Blunck 提交于
      As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for
      memcg to the size of NR_CPUS is not good.
      
      This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS
      but based on nr_cpu_ids.
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8dad2bb
    • K
      memcg: move all acccounting to parent at rmdir() · f817ed48
      KAMEZAWA Hiroyuki 提交于
      This patch provides a function to move account information of a page
      between mem_cgroups and rewrite force_empty to make use of this.
      
      This moving of page_cgroup is done under
       - lru_lock of source/destination mem_cgroup is held.
       - lock_page_cgroup() is held.
      
      Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
      should confirm pc->mem_cgroup is still valid or not.  Typical code can be
      following.
      
      (while page is not under lock_page())
      	mem = pc->mem_cgroup;
      	mz = page_cgroup_zoneinfo(pc)
      	spin_lock_irqsave(&mz->lru_lock);
      	if (pc->mem_cgroup == mem)
      		...../* some list handling */
      	spin_unlock_irqrestore(&mz->lru_lock);
      
      Of course, better way is
      	lock_page_cgroup(pc);
      	....
      	unlock_page_cgroup(pc);
      
      But you should confirm the nest of lock and avoid deadlock.
      
      If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
      you don't have to worry about what pc->mem_cgroup points to.
      moved pages are added to head of lru, not to tail.
      
      Expected users of this routine is:
        - force_empty (rmdir)
        - moving tasks between cgroup (for moving account information.)
        - hierarchy (maybe useful.)
      
      force_empty(rmdir) uses this move_account and move pages to its parent.
      This "move" will not cause OOM (I added "oom" parameter to try_charge().)
      
      If the parent is busy (not enough memory), force_empty calls try_to_free_page()
      and reduce usage.
      
      Purpose of this behavior is
        - Fix "forget all" behavior of force_empty and avoid leak of accounting.
        - By "moving first, free if necessary", keep pages on memory as much as
          possible.
      
      Adding a switch to change behavior of force_empty to
        - free first, move if necessary
        - free all, if there is mlocked/busy pages, return -EBUSY.
      is under consideration. (I'll add if someone requtests.)
      
      This patch also removes memory.force_empty file, a brutal debug-only interface.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f817ed48
    • K
      memcg: simple migration handling · 01b1ae63
      KAMEZAWA Hiroyuki 提交于
      Now, management of "charge" under page migration is done under following
      manner. (Assume migrate page contents from oldpage to newpage)
      
       before
        - "newpage" is charged before migration.
       at success.
        - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
       at failure
        - "newpage" is uncharged.
        - "oldpage" is charged if necessary (*1)
      
      But (*1) is not reliable....because of GFP_ATOMIC.
      
      This patch tries to change behavior as following by charge/commit/cancel ops.
      
       before
        - charge PAGE_SIZE (no target page)
       success
        - commit charge against "newpage".
       failure
        - commit charge against "oldpage".
          (PCG_USED bit works effectively to avoid double-counting)
        - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01b1ae63
    • K
      memcg: fix gfp_mask of callers of charge · bced0520
      KAMEZAWA Hiroyuki 提交于
      Fix misuse of gfp_kernel.
      
      Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
      
      I think that this is from the fact that page_cgroup *was* dynamically
      allocated.
      
      But now, we allocate all page_cgroup at boot.  And
      mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
      specified GFP_RECLAIM_MASK.
      
        * This is because we just want to reduce memory usage.
          "Where we should reclaim from ?" is not a problem in memcg.
      
      This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
      
      Note: This patch is not for fixing behavior but for showing sane information
            in source code.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bced0520
    • K
      memcg: introduce charge-commit-cancel style of functions · 7a81b88c
      KAMEZAWA Hiroyuki 提交于
      There is a small race in do_swap_page().  When the page swapped-in is
      charged, the mapcount can be greater than 0.  But, at the same time some
      process (shares it ) call unmap and make mapcount 1->0 and the page is
      uncharged.
      
            CPUA 			CPUB
             mapcount == 1.
         (1) charge if mapcount==0     zap_pte_range()
                                      (2) mapcount 1 => 0.
      			        (3) uncharge(). (success)
         (4) set page's rmap()
             mapcount 0=>1
      
      Then, this swap page's account is leaked.
      
      For fixing this, I added a new interface.
        - charge
         account to res_counter by PAGE_SIZE and try to free pages if necessary.
        - commit
         register page_cgroup and add to LRU if necessary.
        - cancel
         uncharge PAGE_SIZE because of do_swap_page failure.
      
           CPUA
        (1) charge (always)
        (2) set page's rmap (mapcount > 0)
        (3) commit charge was necessary or not after set_pte().
      
      This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
      Usual mem_cgroup_charge_common() does charge -> commit at a time.
      
      And this patch also adds following function to clarify all charges.
      
        - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
      	called against newly allocated anon pages.
      
        - mem_cgroup_charge_migrate_fixup()
              called only from remove_migration_ptes().
      	we'll have to rewrite this later.(this patch just keeps old behavior)
      	This function will be removed by additional patch to make migration
      	clearer.
      
      Good for clarifying "what we do"
      
      Then, we have 4 following charge points.
        - newpage
        - swap-in
        - add-to-cache.
        - migration.
      
      [akpm@linux-foundation.org: add missing inline directives to stubs]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a81b88c
  2. 07 1月, 2009 1 次提交
  3. 23 10月, 2008 1 次提交
    • K
      memcg: fix page_cgroup allocation · 94b6da5a
      KAMEZAWA Hiroyuki 提交于
      page_cgroup_init() is called from mem_cgroup_init(). But at this
      point, we cannot call alloc_bootmem().
      (and this caused panic at boot.)
      
      This patch moves page_cgroup_init() to init/main.c.
      
      Time table is following:
      ==
        parse_args(). # we can trust mem_cgroup_subsys.disabled bit after this.
        ....
        cgroup_init_early()  # "early" init of cgroup.
        ....
        setup_arch()         # memmap is allocated.
        ...
        page_cgroup_init();
        mem_init();   # we cannot call alloc_bootmem after this.
        ....
        cgroup_init() # mem_cgroup is initialized.
      ==
      
      Before page_cgroup_init(), mem_map must be initialized. So,
      I added page_cgroup_init() to init/main.c directly.
      
      (*) maybe this is not very clean but
          - cgroup_init_early() is too early
          - in cgroup_init(), we have to use vmalloc instead of alloc_bootmem().
          use of vmalloc area in x86-32 is important and we should avoid very large
          vmalloc() in x86-32. So, we want to use alloc_bootmem() and added page_cgroup_init()
          directly to init/main.c
      
      [akpm@linux-foundation.org: remove unneeded/bad mem_cgroup_subsys declaration]
      [akpm@linux-foundation.org: fix build]
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94b6da5a
  4. 20 10月, 2008 8 次提交
  5. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  6. 23 9月, 2008 1 次提交
    • D
      memcg: check under limit at shrink_usage · a10cebf5
      Daisuke Nishimura 提交于
      Current memory cgroup(both in mainline and -mm) doesn't account swap
      caches as memory(swap cache support is dropped temporarily now).
      
      So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that
      have been moved to swap cache.
      
      But this makes mem_cgroup_shrink_usage fail easily if most of the pages
      are anon/shmem, and then shmem_getpage returns -ENOMEM and the process
      will be killed.
      
      This patch adds res_counter_check_under_limit to avoid these cases.
      
      BTW, even if swap cache support is enabled again, if a process is moved to
      another cgroup, which has been just made, between precharge and
      shrink_usage in shmem_getpage, shrink_usage may fail just because there is
      no pages to reclaim.
      
      So this change would make sense anyway.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a10cebf5
  7. 13 8月, 2008 1 次提交
  8. 31 7月, 2008 1 次提交
  9. 26 7月, 2008 10 次提交
    • K
      memcg: limit change shrink usage · 628f4235
      KAMEZAWA Hiroyuki 提交于
      Shrinking memory usage at limit change.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      628f4235
    • L
      memcg: clean up checking of the disabled flag · cede86ac
      Li Zefan 提交于
      Those checks are unnecessary, because when the subsystem is disabled
      it can't be mounted, so those functions won't get called.
      
      The check is needed in functions which will be called in other places
      except cgroup.
      
      [hugh@veritas.com: further checking of disabled flag]
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cede86ac
    • K
      memcg: remove a redundant check · accf163e
      KAMEZAWA Hiroyuki 提交于
      Because of remove refcnt patch, it's very rare case to that
      mem_cgroup_charge_common() is called against a page which is accounted.
      
      mem_cgroup_charge_common() is called when.
       1. a page is added into file cache.
       2. an anon page is _newly_ mapped.
      
      A racy case is that a newly-swapped-in anonymous page is referred from
      prural threads in do_swap_page() at the same time.
      (a page is not Locked when mem_cgroup_charge() is called from do_swap_page.)
      
      Another case is shmem. It charges its page before calling add_to_page_cache().
      Then, mem_cgroup_charge_cache() is called twice. This case is handled in
      mem_cgroup_cache_charge(). But this check may be too hacky...
      
      Signed-off-by : KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf163e
    • K
      memcg: add hints for branch · b76734e5
      KAMEZAWA Hiroyuki 提交于
      Showing brach direction for obvious conditions.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b76734e5
    • K
      memcg: helper function for relcaim from shmem. · c9b0ed51
      KAMEZAWA Hiroyuki 提交于
      A new call, mem_cgroup_shrink_usage() is added for shmem handling and
      relacing non-standard usage of mem_cgroup_charge/uncharge.
      
      Now, shmem calls mem_cgroup_charge() just for reclaim some pages from
      mem_cgroup.  In general, shmem is used by some process group and not for
      global resource (like file caches).  So, it's reasonable to reclaim pages
      from mem_cgroup where shmem is mainly used.
      
      [hugh@veritas.com: shmem_getpage release page sooner]
      [hugh@veritas.com: mem_cgroup_shrink_usage css_put]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9b0ed51
    • K
      memcg: remove refcnt from page_cgroup · 69029cd5
      KAMEZAWA Hiroyuki 提交于
      memcg: performance improvements
      
      Patch Description
       1/5 ... remove refcnt fron page_cgroup patch (shmem handling is fixed)
       2/5 ... swapcache handling patch
       3/5 ... add helper function for shmem's memory reclaim patch
       4/5 ... optimize by likely/unlikely ppatch
       5/5 ... remove redundunt check patch (shmem handling is fixed.)
      
      Unix bench result.
      
      == 2.6.26-rc2-mm1 + memory resource controller
      Execl Throughput                           2915.4 lps   (29.6 secs, 3 samples)
      C Compiler Throughput                      1019.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5796.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1097.7 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               565.3 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1022128.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   544057.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    346481.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      319325.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     148788.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks       99051.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2058917.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1606109.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    854789.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         126145.2 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     2915.4      678.0
      File Copy 1024 bufsize 2000 maxblocks         3960.0   346481.0      875.0
      File Copy 256 bufsize 500 maxblocks           1655.0    99051.0      598.5
      File Copy 4096 bufsize 8000 maxblocks         5800.0   854789.0     1473.8
      Shell Scripts (8 concurrent)                     6.0     1097.7     1829.5
                                                                       =========
           FINAL SCORE                                                     991.3
      
      == 2.6.26-rc2-mm1 + this set ==
      Execl Throughput                           3012.9 lps   (29.9 secs, 3 samples)
      C Compiler Throughput                       981.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (1 concurrent)               5872.0 lpm   (60.0 secs, 3 samples)
      Shell Scripts (8 concurrent)               1120.3 lpm   (60.0 secs, 3 samples)
      Shell Scripts (16 concurrent)               578.0 lpm   (60.0 secs, 3 samples)
      File Read 1024 bufsize 2000 maxblocks    1003993.0 KBps  (30.0 secs, 3 samples)
      File Write 1024 bufsize 2000 maxblocks   550452.0 KBps  (30.0 secs, 3 samples)
      File Copy 1024 bufsize 2000 maxblocks    347159.0 KBps  (30.0 secs, 3 samples)
      File Read 256 bufsize 500 maxblocks      314644.0 KBps  (30.0 secs, 3 samples)
      File Write 256 bufsize 500 maxblocks     151852.0 KBps  (30.0 secs, 3 samples)
      File Copy 256 bufsize 500 maxblocks      101000.0 KBps  (30.0 secs, 3 samples)
      File Read 4096 bufsize 8000 maxblocks    2033256.0 KBps  (30.0 secs, 3 samples)
      File Write 4096 bufsize 8000 maxblocks   1611814.0 KBps  (30.0 secs, 3 samples)
      File Copy 4096 bufsize 8000 maxblocks    847979.0 KBps  (30.0 secs, 3 samples)
      Dc: sqrt(2) to 99 decimal places         128148.7 lpm   (30.0 secs, 3 samples)
      
                           INDEX VALUES
      TEST                                        BASELINE     RESULT      INDEX
      
      Execl Throughput                                43.0     3012.9      700.7
      File Copy 1024 bufsize 2000 maxblocks         3960.0   347159.0      876.7
      File Copy 256 bufsize 500 maxblocks           1655.0   101000.0      610.3
      File Copy 4096 bufsize 8000 maxblocks         5800.0   847979.0     1462.0
      Shell Scripts (8 concurrent)                     6.0     1120.3     1867.2
                                                                       =========
           FINAL SCORE                                                    1004.6
      
      This patch:
      
      Remove refcnt from page_cgroup().
      
      After this,
      
       * A page is charged only when !page_mapped() && no page_cgroup is assigned.
      	* Anon page is newly mapped.
      	* File page is added to mapping->tree.
      
       * A page is uncharged only when
      	* Anon page is fully unmapped.
      	* File page is removed from LRU.
      
      There is no change in behavior from user's view.
      
      This patch also removes unnecessary calls in rmap.c which was used only for
      refcnt mangement.
      
      [akpm@linux-foundation.org: fix warning]
      [hugh@veritas.com: fix shmem_unuse_inode charging]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69029cd5
    • K
      memcg: better migration handling · e8589cc1
      KAMEZAWA Hiroyuki 提交于
      This patch changes page migration under memory controller to use a
      different algorithm.  (thanks to Christoph for new idea.)
      
      Before:
       - page_cgroup is migrated from an old page to a new page.
      After:
       - a new page is accounted , no reuse of page_cgroup.
      
      Pros:
      
       - We can avoid compliated lock depndencies and races in migration.
      
      Cons:
      
       - new param to mem_cgroup_charge_common().
      
       - mem_cgroup_getref() is added for handling ref_cnt ping-pong.
      
      This version simplifies complicated lock dependency in page migraiton
      under memory resource controller.
      
        new refcnt sequence is following.
      
      a mapped page:
        prepage_migration() ..... +1 to NEW page
        try_to_unmap()      ..... all refs to OLD page is gone.
        move_pages()        ..... +1 to NEW page if page cache.
        remap...            ..... all refs from *map* is added to NEW one.
        end_migration()     ..... -1 to New page.
      
        page's mapcount + (page_is_cache) refs are added to NEW one.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8589cc1
    • K
      memcg: avoid unnecessary initialization · 508b7be0
      KAMEZAWA Hiroyuki 提交于
      * remove over-killing initialization (in fast path)
      * makeing the condition for PAGE_CGROUP_FLAG_ACTIVE be more obvious.
      Signed-off-by: NKAMEAZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      508b7be0
    • K
      memcg: make global var read_mostly · a181b0e8
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_subsys and page_cgroup_cache should be read_mostly and
      MEM_CGROUP_RECLAIM_RETRIES can be just a fixed number.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a181b0e8
    • P
      cgroup files: convert res_counter_write() to be a cgroups write_string() handler · 856c13aa
      Paul Menage 提交于
      Currently res_counter_write() is a raw file handler even though it's
      ultimately taking a number, since in some cases it wants to
      pre-process the string when converting it to a number.
      
      This patch converts res_counter_write() from a raw file handler to a
      write_string() handler; this allows some of the boilerplate
      copying/locking/checking to be removed, and simplies the cleanup path,
      since these functions are now performed by the cgroups framework.
      
      [lizf@cn.fujitsu.com: build fix]
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      856c13aa