1. 08 2月, 2008 26 次提交
    • K
      per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup · c0149530
      KAMEZAWA Hiroyuki 提交于
      Add macro to get node_id and zone_id of page_cgroup.  Will be used in
      per-zone-xxx patches and others.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0149530
    • K
      memory cgroup enhancements: implicit force_empty() at rmdir · df878fb0
      KAMEZAWA Hiroyuki 提交于
      Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
      rmdir().
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df878fb0
    • K
      memory cgroup enhancements: add memory.stat file · d2ceb9b7
      KAMEZAWA Hiroyuki 提交于
      Show accounted information of memory cgroup by memory.stat file
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2ceb9b7
    • K
      memory cgroup enhancements: add status accounting function for memory cgroup · d52aa412
      KAMEZAWA Hiroyuki 提交于
      Add statistics account infrastructure for memory controller.  All account
      information is stored per-cpu and caller will not have to take lock or use
      atomic ops.  This will be used by memory.stat file later.
      
      CACHE includes swapcache now. I'd like to divide it to
      PAGECACHE and SWAPCACHE later.
      
      This patch adds 3 functions for accounting.
       * __mem_cgroup_stat_add() ... for usual routine.
       * __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
       * mem_cgroup_read_stat() ... for reading stat value.
       * renamed PAGECACHE to CACHE (because it may include swapcache *now*)
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
      [akpm@linux-foundation.org: uninline things]
      [akpm@linux-foundation.org: remove dead code]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Paul Menage <menage@google.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d52aa412
    • K
      memory cgroup enhancements: remember "a page is on active list of cgroup or not" · 3564c7c4
      KAMEZAWA Hiroyuki 提交于
      Remember page_cgroup is on active_list or not in page_cgroup->flags.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3564c7c4
    • H
      memcgroup: fix hang with shmem/tmpfs · 82369553
      Hugh Dickins 提交于
      The memcgroup regime relies upon a cgroup reclaiming pages from itself within
      add_to_page_cache: which may involve some waiting.  Whereas shmem and tmpfs
      rely upon using add_to_page_cache while holding a spinlock: when it cannot
      wait.  The consequence is that when a cgroup reaches its limit, shmem_getpage
      just hangs - unless there is outside memory pressure too, neither kswapd nor
      radix_tree_preload get it out of the retry loop.
      
      In most cases we can mem_cgroup_cache_charge the page waitably first, to
      attach the page_cgroup in advance, so add_to_page_cache will do no more than
      increment a count; then mem_cgroup_uncharge_page after (in both success and
      failure cases) to balance the books again.
      
      And where there used to be a congestion_wait for kswapd (recently made
      redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
      to go through a cycle of allocation and freeing, without accounting to any
      particular page, and without updating the statistics vector.  This brings the
      cgroup below its limit so the next try usually succeeds.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82369553
    • H
      memcgroup: tidy up mem_cgroup_charge_common · 3be91277
      Hugh Dickins 提交于
      Tidy up mem_cgroup_charge_common before extending it.  Adjust some comments,
      but mainly clean up its loop: I've an aversion to loops full of continues,
      then a break or a goto at the bottom.  And the is_atomic test should be on the
      __GFP_WAIT bit, not GFP_ATOMIC bits.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3be91277
    • B
      Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() · ac44d354
      Balbir Singh 提交于
      Hugh Dickins noticed that we were using rcu_dereference() without
      rcu_read_lock() in the cache charging routine. The patch below fixes
      this problem
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac44d354
    • K
      memory cgroup enhancements: remember "a page is charged as page cache" · 217bc319
      KAMEZAWA Hiroyuki 提交于
      Add a flag to page_cgroup to remember "this page is
      charged as cache."
      cache here includes page caches and swap cache.
      This is useful for implementing precise accounting in memory cgroup.
      TODO:
        distinguish page-cache and swap-cache
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      217bc319
    • K
      memory cgroup enhancements: force_empty interface for dropping all account in empty cgroup · cc847582
      KAMEZAWA Hiroyuki 提交于
      This patch adds an interface "memory.force_empty".  Any write to this file
      will drop all charges in this cgroup if there is no task under.
      
      %echo 1 > /....../memory.force_empty
      
      will drop all charges of memory cgroup if cgroup's tasks is empty.
      
      This is useful to invoke rmdir() against memory cgroup successfully.
      
      Tested and worked well on x86_64/fake-NUMA system.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc847582
    • H
      memcgroup: fix zone isolation OOM · 436c6541
      Hugh Dickins 提交于
      mem_cgroup_charge_common shows a tendency to OOM without good reason, when
      a memhog goes well beyond its rss limit but with plenty of swap available.
      Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
      from memcgroup, but we presume it can happen without.
      
      mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
      avoidance.  Already it has to scan beyond the nr_to_scan limit when it
      finds a !LRU page or an active page when handling inactive or an inactive
      page when handling active.  It needs to do exactly the same when it finds a
      page from the wrong zone (the x86 tests had two zones, the PowerPC tests
      had only one).
      
      Don't increment scan and then decrement it in these cases, just move the
      incrementation down.  Fix recent off-by-one when checking against
      nr_to_scan.  Cut out "Check if the meta page went away from under us",
      presumably left over from early debugging: no amount of such checks could
      save us if this list really were being updated without locking.
      
      This change does make the unlimited scan while holding two spinlocks
      even worse - bad for latency and bad for containment; but that's a
      separate issue which is better left to be fixed a little later.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      436c6541
    • K
      bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages · ff7283fa
      KAMEZAWA Hiroyuki 提交于
      This patch makes mem_cgroup_isolate_pages() to be
      
        - ignore !PageLRU pages.
        - fixes the bug that isolation makes no progress if page_zone(page) != zone
          page once find. (just increment scan in this case.)
      
      kswapd and memory migration removes a page from list when it handles
      a page for reclaiming/migration.
      
      Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
      be safe to avoid touching !PageLRU() page and its page_cgroup.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff7283fa
    • K
      bugfix for memory cgroup controller: migration under memory controller fix · ae41be37
      KAMEZAWA Hiroyuki 提交于
      While using memory control cgroup, page-migration under it works as following.
      ==
       1. uncharge all refs at try to unmap.
       2. charge regs again remove_migration_ptes()
      ==
      This is simple but has following problems.
      ==
       The page is uncharged and charged back again if *mapped*.
          - This means that cgroup before migration can be different from one after
            migration
          - If page is not mapped but charged as page cache, charge is just ignored
            (because not mapped, it will not be uncharged before migration)
            This is memory leak.
      ==
      This patch tries to keep memory cgroup at page migration by increasing
      one refcnt during it. 3 functions are added.
      
       mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
       mem_cgroup_end_migration()     --- decrease refcnt of page->page_cgroup
       mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
                                       new page.
      
      During migration
        - old page is under PG_locked.
        - new page is under PG_locked, too.
        - both old page and new page is not on LRU.
      
      These 3 facts guarantee that page_cgroup() migration has no race.
      
      Tested and worked well in x86_64/fake-NUMA box.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae41be37
    • K
      bugfix for memory controller: add helper function for assigning cgroup to page · 9175e031
      KAMEZAWA Hiroyuki 提交于
      This patch adds following functions.
         - clear_page_cgroup(page, pc)
         - page_cgroup_assign_new_page_group(page, pc)
      
      Mainly for cleanup.
      
      A manner "check page->cgroup again after lock_page_cgroup()" is
      implemented in straight way.
      
      A comment in mem_cgroup_uncharge() will be removed by force-empty patch
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9175e031
    • D
      memcontrol: move oom task exclusion to tasklist scan · 4c4a2214
      David Rientjes 提交于
      Creates a helper function to return non-zero if a task is a member of a
      memory controller:
      
      	int task_in_mem_cgroup(const struct task_struct *task,
      			       const struct mem_cgroup *mem);
      
      When the OOM killer is constrained by the memory controller, the exclusion
      of tasks that are not a member of that controller was previously misplaced
      and appeared in the badness scoring function.  It should be excluded
      during the tasklist scan in select_bad_process() instead.
      
      [akpm@linux-foundation.org: build fix]
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c4a2214
    • D
      memcontrol: move mm_cgroup to header file · 3062fc67
      David Rientjes 提交于
      Inline functions must preceed their use, so mm_cgroup() should be defined
      in linux/memcontrol.h.
      
      include/linux/memcontrol.h:48: warning: 'mm_cgroup' declared inline after
      	being called
      include/linux/memcontrol.h:48: warning: previous declaration of
      	'mm_cgroup' was here
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3062fc67
    • B
      Memory controller: make charging gfp mask aware · e1a1cd59
      Balbir Singh 提交于
      Nick Piggin pointed out that swap cache and page cache addition routines
      could be called from non GFP_KERNEL contexts.  This patch makes the
      charging routine aware of the gfp context.  Charging might fail if the
      cgroup is over it's limit, in which case a suitable error is returned.
      
      This patch was tested on a Powerpc box.  I am still looking at being able
      to test the path, through which allocations happen in non GFP_KERNEL
      contexts.
      
      [kamezawa.hiroyu@jp.fujitsu.com: problem with ZONE_MOVABLE]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1a1cd59
    • B
      Memory controller: make page_referenced() cgroup aware · bed7161a
      Balbir Singh 提交于
      Make page_referenced() cgroup aware.  Without this patch, page_referenced()
      can cause a page to be skipped while reclaiming pages.  This patch ensures
      that other cgroups do not hold pages in a particular cgroup hostage.  It
      is required to ensure that shared pages are freed from a cgroup when they
      are not actively referenced from the cgroup that brought them in
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bed7161a
    • B
      Memory controller: add switch to control what type of pages to limit · 8697d331
      Balbir Singh 提交于
      Choose if we want cached pages to be accounted or not.  By default both are
      accounted for.  A new set of tunables are added.
      
      echo -n 1 > mem_control_type
      
      switches the accounting to account for only mapped pages
      
      echo -n 3 > mem_control_type
      
      switches the behaviour back
      
      [bunk@kernel.org: mm/memcontrol.c: clenups]
      [akpm@linux-foundation.org: fix sparc32 build]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8697d331
    • P
      Memory controller: OOM handling · c7ba5c9e
      Pavel Emelianov 提交于
      Out of memory handling for cgroups over their limit. A task from the
      cgroup over limit is chosen using the existing OOM logic and killed.
      
      TODO:
      1. As discussed in the OLS BOF session, consider implementing a user
      space policy for OOM handling.
      
      [akpm@linux-foundation.org: fix build due to oom-killer changes]
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7ba5c9e
    • B
      Memory controller improve user interface · 0eea1030
      Balbir Singh 提交于
      Change the interface to use bytes instead of pages.  Page sizes can vary
      across platforms and configurations.  A new strategy routine has been added
      to the resource counters infrastructure to format the data as desired.
      
      Suggested by David Rientjes, Andrew Morton and Herbert Poetzl
      
      Tested on a UML setup with the config for memory control enabled.
      
      [kamezawa.hiroyu@jp.fujitsu.com: possible race fix in res_counter]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eea1030
    • B
      Memory controller: add per cgroup LRU and reclaim · 66e1707b
      Balbir Singh 提交于
      Add the page_cgroup to the per cgroup LRU.  The reclaim algorithm has
      been modified to make the isolate_lru_pages() as a pluggable component.  The
      scan_control data structure now accepts the cgroup on behalf of which
      reclaims are carried out.  try_to_free_pages() has been extended to become
      cgroup aware.
      
      [akpm@linux-foundation.org: fix warning]
      [Lee.Schermerhorn@hp.com: initialize all scan_control's isolate_pages member]
      [bunk@kernel.org: make do_try_to_free_pages() static]
      [hugh@veritas.com: memcgroup: fix try_to_free order]
      [kamezawa.hiroyu@jp.fujitsu.com: this unlock_page_cgroup() is unnecessary]
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66e1707b
    • B
      Memory controller: task migration · 67e465a7
      Balbir Singh 提交于
      Allow tasks to migrate from one cgroup to the other.  We migrate
      mm_struct's mem_cgroup only when the thread group id migrates.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67e465a7
    • B
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh 提交于
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
    • P
      Memory controller: accounting setup · 78fb7466
      Pavel Emelianov 提交于
      Basic setup routines, the mm_struct has a pointer to the cgroup that
      it belongs to and the the page has a page_cgroup associated with it.
      Signed-off-by: NPavel Emelianov <xemul@openvz.org>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78fb7466
    • B
      Memory controller: cgroups setup · 8cdea7c0
      Balbir Singh 提交于
      Setup the memory cgroup and add basic hooks and controls to integrate
      and work with the cgroup.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cdea7c0