1. 09 1月, 2009 24 次提交
    • K
      inactive_anon_is_low: move to vmscan · f89eb90e
      KOSAKI Motohiro 提交于
      The inactive_anon_is_low() is called only vmscan.  Then it can move to
      vmscan.c
      
      This patch doesn't have any functional change.
      Reviewd-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f89eb90e
    • D
      memcg: hierarchy avoid unnecessary reclaim · 670ec2f1
      Daisuke Nishimura 提交于
      If hierarchy is not used, no tree-walk is necessary.
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      670ec2f1
    • K
      memcg: swapout refcnt fix · a7fe942e
      KAMEZAWA Hiroyuki 提交于
      css's refcnt is dropped before end of following access.
      Hold it until end of access.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7fe942e
    • D
      memcg: memory swap controller: fix limit check · b85a96c0
      Daisuke Nishimura 提交于
      There are scatterd calls of res_counter_check_under_limit(), and most of
      them don't take mem+swap accounting into account.
      
      define mem_cgroup_check_under_limit() and avoid direct use of
      res_counter_check_limit().
      Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b85a96c0
    • N
      memcg: check group leader fix · f9717d28
      Nikanth Karthikesan 提交于
      Remove unnecessary codes (...fragments of not-implemented
      functionalilty...)
      Reported-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9717d28
    • K
      memcg: revert gfp mask fix · 2c26fdd7
      KAMEZAWA Hiroyuki 提交于
      My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
      of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
      happen at memory reclaim.
      
      But in recent discussion, it's NACKed because it sounds ugly.
      
      This patch is for reverting it and add some clean up to gfp_mask of
      callers of charge.  No behavior change but need review before generating
      HUNK in deep queue.
      
      This patch also adds explanation to meaning of gfp_mask passed to charge
      functions in memcontrol.h.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c26fdd7
    • K
      memcg: fix reclaim result checks · 88700756
      KAMEZAWA Hiroyuki 提交于
      check_under_limit logic was wrong and this check should be against
      mem_over_limit rather than mem.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jan Blunck <jblunck@suse.de>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88700756
    • K
      memcg: avoid unnecessary system-wide-oom-killer · a636b327
      KAMEZAWA Hiroyuki 提交于
      Current mmtom has new oom function as pagefault_out_of_memory().  It's
      added for select bad process rathar than killing current.
      
      When memcg hit limit and calls OOM at page_fault, this handler called and
      system-wide-oom handling happens.  (means kernel panics if panic_on_oom is
      true....)
      
      To avoid overkill, check memcg's recent behavior before starting
      system-wide-oom.
      
      And this patch also fixes to guarantee "don't accnout against process with
      TIF_MEMDIE".  This is necessary for smooth OOM.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Jan Blunck <jblunck@suse.de>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a636b327
    • B
      memcg: memory cgroup hierarchy feature selector · 18f59ea7
      Balbir Singh 提交于
      Don't enable multiple hierarchy support by default.  This patch introduces
      a features element that can be set to enable the nested depth hierarchy
      feature.  This feature can only be enabled when the cgroup for which the
      feature this is enabled, has no children.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18f59ea7
    • B
      memcg: memory cgroup hierarchical reclaim · 6d61ef40
      Balbir Singh 提交于
      This patch introduces hierarchical reclaim.  When an ancestor goes over
      its limit, the charging routine points to the parent that is above its
      limit.  The reclaim process then starts from the last scanned child of the
      ancestor and reclaims until the ancestor goes below its limit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d61ef40
    • B
      memcg: memory cgroup resource counters for hierarchy · 28dbc4b6
      Balbir Singh 提交于
      Add support for building hierarchies in resource counters.  Cgroups allows
      us to build a deep hierarchy, but we currently don't link the resource
      counters belonging to the memory controller control groups, in the same
      fashion as the corresponding cgroup entries in the cgroup hierarchy.  This
      patch provides the infrastructure for resource counters that have the same
      hiearchy as their cgroup counter parts.
      
      These set of patches are based on the resource counter hiearchy patches
      posted by Pavel Emelianov.
      
      NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
      the all the way up to the root.  It is known that hiearchies are
      expensive, so the user needs to be careful and aware of the trade-offs
      before creating very deep ones.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28dbc4b6
    • H
      memcg: add mem_cgroup_disabled() · f8d66542
      Hirokazu Takahashi 提交于
      We check mem_cgroup is disabled or not by checking
      mem_cgroup_subsys.disabled.  I think it has more references than expected,
      now.
      
      replacing
         if (mem_cgroup_subsys.disabled)
      with
         if (mem_cgroup_disabled())
      
      give us good look, I think.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix typo]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8d66542
    • K
      memcg: synchronized LRU · 08e552c6
      KAMEZAWA Hiroyuki 提交于
      A big patch for changing memcg's LRU semantics.
      
      Now,
        - page_cgroup is linked to mem_cgroup's its own LRU (per zone).
      
        - LRU of page_cgroup is not synchronous with global LRU.
      
        - page and page_cgroup is one-to-one and statically allocated.
      
        - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
          - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);
      
        - SwapCache is handled.
      
      And, when we handle LRU list of page_cgroup, we do following.
      
      	pc = lookup_page_cgroup(page);
      	lock_page_cgroup(pc); .....................(1)
      	mz = page_cgroup_zoneinfo(pc);
      	spin_lock(&mz->lru_lock);
      	.....add to LRU
      	spin_unlock(&mz->lru_lock);
      	unlock_page_cgroup(pc);
      
      But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
      So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.
      
      This is a trial to remove this dirty nesting of locks.
      This patch changes mz->lru_lock to be zone->lru_lock.
      Then, above sequence will be written as
      
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      	mem_cgroup_add/remove/etc_lru() {
      		pc = lookup_page_cgroup(page);
      		mz = page_cgroup_zoneinfo(pc);
      		if (PageCgroupUsed(pc)) {
      			....add to LRU
      		}
              spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
      
      This is much simpler.
      (*) We're safe even if we don't take lock_page_cgroup(pc). Because..
          1. When pc->mem_cgroup can be modified.
             - at charge.
             - at account_move().
          2. at charge
             the PCG_USED bit is not set before pc->mem_cgroup is fixed.
          3. at account_move()
             the page is isolated and not on LRU.
      
      Pros.
        - easy for maintenance.
        - memcg can make use of laziness of pagevec.
        - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
        - LRU status of memcg will be synchronized with global LRU's one.
        - # of locks are reduced.
        - account_move() is simplified very much.
      Cons.
        - may increase cost of LRU rotation.
          (no impact if memcg is not configured.)
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08e552c6
    • K
      memcg: mem+swap controller core · 8c7c6e34
      KAMEZAWA Hiroyuki 提交于
      This patch implements per cgroup limit for usage of memory+swap.  However
      there are SwapCache, double counting of swap-cache and swap-entry is
      avoided.
      
      Mem+Swap controller works as following.
        - memory usage is limited by memory.limit_in_bytes.
        - memory + swap usage is limited by memory.memsw_limit_in_bytes.
      
      This has following benefits.
        - A user can limit total resource usage of mem+swap.
      
          Without this, because memory resource controller doesn't take care of
          usage of swap, a process can exhaust all the swap (by memory leak.)
          We can avoid this case.
      
          And Swap is shared resource but it cannot be reclaimed (goes back to memory)
          until it's used. This characteristic can be trouble when the memory
          is divided into some parts by cpuset or memcg.
          Assume group A and group B.
          After some application executes, the system can be..
      
          Group A -- very large free memory space but occupy 99% of swap.
          Group B -- under memory shortage but cannot use swap...it's nearly full.
      
          Ability to set appropriate swap limit for each group is required.
      
      Maybe someone wonder "why not swap but mem+swap ?"
      
        - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
          to move account from memory to swap...there is no change in usage of
          mem+swap.
      
          In other words, when we want to limit the usage of swap without affecting
          global LRU, mem+swap limit is better than just limiting swap.
      
      Accounting target information is stored in swap_cgroup which is
      per swap entry record.
      
      Charge is done as following.
        map
          - charge  page and memsw.
      
        unmap
          - uncharge page/memsw if not SwapCache.
      
        swap-out (__delete_from_swap_cache)
          - uncharge page
          - record mem_cgroup information to swap_cgroup.
      
        swap-in (do_swap_page)
          - charged as page and memsw.
            record in swap_cgroup is cleared.
            memsw accounting is decremented.
      
        swap-free (swap_free())
          - if swap entry is freed, memsw is uncharged by PAGE_SIZE.
      
      There are people work under never-swap environments and consider swap as
      something bad. For such people, this mem+swap controller extension is just an
      overhead.  This overhead is avoided by config or boot option.
      (see Kconfig. detail is not in this patch.)
      
      TODO:
       - maybe more optimization can be don in swap-in path. (but not very safe.)
         But we just do simple accounting at this stage.
      
      [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
      [hugh@veritas.com: memswap controller core swapcache fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c7c6e34
    • K
      memcg: swap cgroup for remembering usage · 27a7faa0
      KAMEZAWA Hiroyuki 提交于
      For accounting swap, we need a record per swap entry, at least.
      
      This patch adds following function.
        - swap_cgroup_swapon() .... called from swapon
        - swap_cgroup_swapoff() ... called at the end of swapoff.
      
        - swap_cgroup_record() .... record information of swap entry.
        - swap_cgroup_lookup() .... lookup information of swap entry.
      
      This patch just implements "how to record information".  No actual method
      for limit the usage of swap.  These routine uses flat table to record and
      lookup.  "wise" lookup system like radix-tree requires requires memory
      allocation at new records but swap-out is usually called under memory
      shortage (or memcg hits limit.) So, I used static allocation.  (maybe
      dynamic allocation is not very hard but it adds additional memory
      allocation in memory shortage path.)
      
      Note1: In this, we use pointer to record information and this means
            8bytes per swap entry. I think we can reduce this when we
            create "id of cgroup" in the range of 0-65535 or 0-255.
      Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Reported-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27a7faa0
    • K
      memcg: mem+swap controller Kconfig · c077719b
      KAMEZAWA Hiroyuki 提交于
      Config and control variable for mem+swap controller.
      
      This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP
      (memory resource controller swap extension.)
      
      For accounting swap, it's obvious that we have to use additional memory to
      remember "who uses swap".  This adds more overhead.  So, it's better to
      offer "choice" to users.  This patch adds 2 choices.
      
      This patch adds 2 parameters to enable swap extension or not.
        - CONFIG
        - boot option
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c077719b
    • K
      memcg: handle swap caches · d13d1443
      KAMEZAWA Hiroyuki 提交于
      SwapCache support for memory resource controller (memcg)
      
      Before mem+swap controller, memcg itself should handle SwapCache in proper
      way.  This is cut-out from it.
      
      In current memcg, SwapCache is just leaked and the user can create tons of
      SwapCache.  This is a leak of account and should be handled.
      
      SwapCache accounting is done as following.
      
        charge (anon)
      	- charged when it's mapped.
      	  (because of readahead, charge at add_to_swap_cache() is not sane)
        uncharge (anon)
      	- uncharged when it's dropped from swapcache and fully unmapped.
      	  means it's not uncharged at unmap.
      	  Note: delete from swap cache at swap-in is done after rmap information
      	        is established.
        charge (shmem)
      	- charged at swap-in. this prevents charge at add_to_page_cache().
      
        uncharge (shmem)
      	- uncharged when it's dropped from swapcache and not on shmem's
      	  radix-tree.
      
        at migration, check against 'old page' is modified to handle shmem.
      
      Comparing to the old version discussed (and caused troubles), we have
      advantages of
        - PCG_USED bit.
        - simple migrating handling.
      
      So, situation is much easier than several months ago, maybe.
      
      [hugh@veritas.com: memcg: handle swap caches build fix]
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d13d1443
    • K
      memcg: new force_empty to free pages under group · c1e862c1
      KAMEZAWA Hiroyuki 提交于
      By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of
      memory usage and force_empty is removed.
      
      This patch adds "force_empty" again, in reasonable manner.
      
      memory.force_empty file works when
      
        #echo 0 (or some) > memory.force_empty
        and have following function.
      
        1. only works when there are no task in this cgroup.
        2. free all page under this cgroup as much as possible.
        3. page which cannot be freed will be moved up to parent.
        4. Then, memcg will be empty after above echo returns.
      
      This is much better behavior than old "force_empty" which just forget
      all accounts. This patch also check signal_pending() and above "echo"
      can be stopped by "Ctrl-C".
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1e862c1
    • J
      memcg: reduce size of mem_cgroup by using nr_cpu_ids · c8dad2bb
      Jan Blunck 提交于
      As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for
      memcg to the size of NR_CPUS is not good.
      
      This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS
      but based on nr_cpu_ids.
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8dad2bb
    • K
      memcg: move all acccounting to parent at rmdir() · f817ed48
      KAMEZAWA Hiroyuki 提交于
      This patch provides a function to move account information of a page
      between mem_cgroups and rewrite force_empty to make use of this.
      
      This moving of page_cgroup is done under
       - lru_lock of source/destination mem_cgroup is held.
       - lock_page_cgroup() is held.
      
      Then, a routine which touches pc->mem_cgroup without lock_page_cgroup()
      should confirm pc->mem_cgroup is still valid or not.  Typical code can be
      following.
      
      (while page is not under lock_page())
      	mem = pc->mem_cgroup;
      	mz = page_cgroup_zoneinfo(pc)
      	spin_lock_irqsave(&mz->lru_lock);
      	if (pc->mem_cgroup == mem)
      		...../* some list handling */
      	spin_unlock_irqrestore(&mz->lru_lock);
      
      Of course, better way is
      	lock_page_cgroup(pc);
      	....
      	unlock_page_cgroup(pc);
      
      But you should confirm the nest of lock and avoid deadlock.
      
      If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock,
      you don't have to worry about what pc->mem_cgroup points to.
      moved pages are added to head of lru, not to tail.
      
      Expected users of this routine is:
        - force_empty (rmdir)
        - moving tasks between cgroup (for moving account information.)
        - hierarchy (maybe useful.)
      
      force_empty(rmdir) uses this move_account and move pages to its parent.
      This "move" will not cause OOM (I added "oom" parameter to try_charge().)
      
      If the parent is busy (not enough memory), force_empty calls try_to_free_page()
      and reduce usage.
      
      Purpose of this behavior is
        - Fix "forget all" behavior of force_empty and avoid leak of accounting.
        - By "moving first, free if necessary", keep pages on memory as much as
          possible.
      
      Adding a switch to change behavior of force_empty to
        - free first, move if necessary
        - free all, if there is mlocked/busy pages, return -EBUSY.
      is under consideration. (I'll add if someone requtests.)
      
      This patch also removes memory.force_empty file, a brutal debug-only interface.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f817ed48
    • F
      memcg: do not recalculate section unnecessarily in init_section_page_cgroup · 0753b0ef
      Fernando Luis Vazquez Cao 提交于
      In init_section_page_cgroup() the section a given pfn belongs to is
      calculated at the top of the function and, despite the fact that the
      pfn/section correspondence does not change, it is recalculated further
      down the same function.  By computing this just once and reusing that
      value we save some bytes in the object file and do not waste CPU cycles.
      Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753b0ef
    • K
      memcg: simple migration handling · 01b1ae63
      KAMEZAWA Hiroyuki 提交于
      Now, management of "charge" under page migration is done under following
      manner. (Assume migrate page contents from oldpage to newpage)
      
       before
        - "newpage" is charged before migration.
       at success.
        - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace)
       at failure
        - "newpage" is uncharged.
        - "oldpage" is charged if necessary (*1)
      
      But (*1) is not reliable....because of GFP_ATOMIC.
      
      This patch tries to change behavior as following by charge/commit/cancel ops.
      
       before
        - charge PAGE_SIZE (no target page)
       success
        - commit charge against "newpage".
       failure
        - commit charge against "oldpage".
          (PCG_USED bit works effectively to avoid double-counting)
        - if "oldpage" is obsolete, cancel charge of PAGE_SIZE.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01b1ae63
    • K
      memcg: fix gfp_mask of callers of charge · bced0520
      KAMEZAWA Hiroyuki 提交于
      Fix misuse of gfp_kernel.
      
      Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL.
      
      I think that this is from the fact that page_cgroup *was* dynamically
      allocated.
      
      But now, we allocate all page_cgroup at boot.  And
      mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE +
      specified GFP_RECLAIM_MASK.
      
        * This is because we just want to reduce memory usage.
          "Where we should reclaim from ?" is not a problem in memcg.
      
      This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible.
      
      Note: This patch is not for fixing behavior but for showing sane information
            in source code.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bced0520
    • K
      memcg: introduce charge-commit-cancel style of functions · 7a81b88c
      KAMEZAWA Hiroyuki 提交于
      There is a small race in do_swap_page().  When the page swapped-in is
      charged, the mapcount can be greater than 0.  But, at the same time some
      process (shares it ) call unmap and make mapcount 1->0 and the page is
      uncharged.
      
            CPUA 			CPUB
             mapcount == 1.
         (1) charge if mapcount==0     zap_pte_range()
                                      (2) mapcount 1 => 0.
      			        (3) uncharge(). (success)
         (4) set page's rmap()
             mapcount 0=>1
      
      Then, this swap page's account is leaked.
      
      For fixing this, I added a new interface.
        - charge
         account to res_counter by PAGE_SIZE and try to free pages if necessary.
        - commit
         register page_cgroup and add to LRU if necessary.
        - cancel
         uncharge PAGE_SIZE because of do_swap_page failure.
      
           CPUA
        (1) charge (always)
        (2) set page's rmap (mapcount > 0)
        (3) commit charge was necessary or not after set_pte().
      
      This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
      Usual mem_cgroup_charge_common() does charge -> commit at a time.
      
      And this patch also adds following function to clarify all charges.
      
        - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge()
      	called against newly allocated anon pages.
      
        - mem_cgroup_charge_migrate_fixup()
              called only from remove_migration_ptes().
      	we'll have to rewrite this later.(this patch just keeps old behavior)
      	This function will be removed by additional patch to make migration
      	clearer.
      
      Good for clarifying "what we do"
      
      Then, we have 4 following charge points.
        - newpage
        - swap-in
        - add-to-cache.
        - migration.
      
      [akpm@linux-foundation.org: add missing inline directives to stubs]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a81b88c
  2. 07 1月, 2009 16 次提交