1. 30 5月, 2012 20 次提交
  2. 20 5月, 2012 1 次提交
    • H
      memcg,thp: fix res_counter:96 regression · 62ade86a
      Hugh Dickins 提交于
      Occasionally, testing memcg's move_charge_at_immigrate on rc7 shows
      a flurry of hundreds of warnings at kernel/res_counter.c:96, where
      res_counter_uncharge_locked() does WARN_ON(counter->usage < val).
      
      The first trace of each flurry implicates __mem_cgroup_cancel_charge()
      of mc.precharge, and an audit of mc.precharge handling points to
      mem_cgroup_move_charge_pte_range()'s THP handling in commit 12724850
      ("memcg: avoid THP split in task migration").
      
      Checking !mc.precharge is good everywhere else, when a single page is to
      be charged; but here the "mc.precharge -= HPAGE_PMD_NR" likely to
      follow, is liable to result in underflow (a lot can change since the
      precharge was estimated).
      
      Simply check against HPAGE_PMD_NR: there's probably a better
      alternative, trying precharge for more, splitting if unsuccessful; but
      this one-liner is safer for now - no kernel/res_counter.c:96 warnings
      seen in 26 hours.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62ade86a
  3. 11 5月, 2012 1 次提交
  4. 26 4月, 2012 1 次提交
  5. 19 4月, 2012 1 次提交
    • H
      memcg: fix Bad page state after replace_page_cache · 9b7f43af
      Hugh Dickins 提交于
      My 9ce70c02 "memcg: fix deadlock by inverting lrucare nesting" put a
      nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(),
      sometimes used for FUSE.  Replacing __mem_cgroup_commit_charge_lrucare()
      by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier:
      but it's for oldpage, and needs now to be for newpage.  Once oldpage was
      freed, its PageCgroupUsed bit (cleared above but set again here) caused
      "Bad page state" messages - and perhaps worse, being missed from newpage.
      (I didn't find this by using FUSE, but in reusing the function for tmpfs.)
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org [v3.3 only]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b7f43af
  6. 13 4月, 2012 2 次提交
  7. 11 4月, 2012 2 次提交
  8. 02 4月, 2012 3 次提交
    • T
      cgroup: make css->refcnt clearing on cgroup removal optional · 48ddbe19
      Tejun Heo 提交于
      Currently, cgroup removal tries to drain all css references.  If there
      are active css references, the removal logic waits and retries
      ->pre_detroy() until either all refs drop to zero or removal is
      cancelled.
      
      This semantics is unusual and adds non-trivial complexity to cgroup
      core and IMHO is fundamentally misguided in that it couples internal
      implementation details (references to internal data structure) with
      externally visible operation (rmdir).  To userland, this is a behavior
      peculiarity which is unnecessary and difficult to expect (css refs is
      otherwise invisible from userland), and, to policy implementations,
      this is an unnecessary restriction (e.g. blkcg wants to hold css refs
      for caching purposes but can't as that becomes visible as rmdir hang).
      
      Unfortunately, memcg currently depends on ->pre_destroy() retrials and
      cgroup removal vetoing and can't be immmediately switched to the new
      behavior.  This patch introduces the new behavior of not waiting for
      css refs to drain and maintains the old behavior for subsystems which
      have __DEPRECATED_clear_css_refs set.
      
      Once, memcg is updated, we can drop the code paths for the old
      behavior as proposed in the following patch.  Note that the following
      patch is incorrect in that dput work item is in cgroup and may lose
      some of dputs when multiples css's are released back-to-back, and
      __css_put() triggers check_for_release() when refcnt reaches 0 instead
      of 1; however, it shows what part can be removed.
      
        http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
      
      Note that, in not-too-distant future, cgroup core will start emitting
      warning messages for subsys which require the old behavior, so please
      get moving.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      48ddbe19
    • T
      cgroup: convert memcg controller to the new cftype interface · 6bc10349
      Tejun Heo 提交于
      Convert memcg to use the new cftype based interface.  kmem support
      abuses ->populate() for mem_cgroup_sockets_init() so it can't be
      removed at the moment.
      
      tcp_memcontrol is updated so that tcp_files[] is registered via a
      __initcall.  This change also allows removing the forward declaration
      of tcp_files[].  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      6bc10349
    • T
      memcg: always create memsw files if CONFIG_CGROUP_MEM_RES_CTLR_SWAP · af36f906
      Tejun Heo 提交于
      Instead of conditioning creation of memsw files on do_swap_account,
      always create the files if compiled-in and fail read/write attempts
      with -EOPNOTSUPP if !do_swap_account.
      
      This is suggested by KAMEZAWA to simplify memcg file creation so that
      it can use cgroup->subsys_cftypes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      af36f906
  9. 29 3月, 2012 1 次提交
  10. 22 3月, 2012 8 次提交
    • N
      memcg: avoid THP split in task migration · 12724850
      Naoya Horiguchi 提交于
      Currently we can't do task migration among memory cgroups without THP
      split, which means processes heavily using THP experience large overhead
      in task migration.  This patch introduces the code for moving charge of
      THP and makes THP more valuable.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12724850
    • N
      memcg: clean up existing move charge code · 8d32ff84
      Naoya Horiguchi 提交于
      - Replace lengthy function name is_target_pte_for_mc() with a shorter
        one in order to avoid ugly line breaks.
      
      - explicitly use MC_TARGET_* instead of simply using integers.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d32ff84
    • J
    • A
      mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() · 45f3e385
      Anton Vorontsov 提交于
      In the following code:
      
      	if (type == _MEM)
      		thresholds = &memcg->thresholds;
      	else if (type == _MEMSWAP)
      		thresholds = &memcg->memsw_thresholds;
      	else
      		BUG();
      
      	BUG_ON(!thresholds);
      
      The BUG_ON() seems redundant.
      Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45f3e385
    • A
      mm/memcontrol.c: s/stealed/stolen/ · 13fd1dd9
      Andrew Morton 提交于
      A grammatical fix.
      
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fd1dd9
    • K
      memcg: fix performance of mem_cgroup_begin_update_page_stat() · 4331f7d3
      KAMEZAWA Hiroyuki 提交于
      mem_cgroup_begin_update_page_stat() should be very fast because it's
      called very frequently.  Now, it needs to look up page_cgroup and its
      memcg....this is slow.
      
      This patch adds a global variable to check "any memcg is moving or not".
      With this, the caller doesn't need to visit page_cgroup and memcg.
      
      Here is a test result.  A test program makes page faults onto a file,
      MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the
      range by madvise() and page fault again.  This program causes 26214400
      times of page fault onto a file(size was 1G.) and shows shows the cost of
      mem_cgroup_begin_update_page_stat().
      
      Before this patch for mem_cgroup_begin_update_page_stat()
      
          [kamezawa@bluextal test]$ time ./mmap 1G
      
          real    0m21.765s
          user    0m5.999s
          sys     0m15.434s
      
          27.46%     mmap  mmap               [.] reader
          21.15%     mmap  [kernel.kallsyms]  [k] page_fault
           9.17%     mmap  [kernel.kallsyms]  [k] filemap_fault
           2.96%     mmap  [kernel.kallsyms]  [k] __do_fault
           2.83%     mmap  [kernel.kallsyms]  [k] __mem_cgroup_begin_update_page_stat
      
      After this patch
      
          [root@bluextal test]# time ./mmap 1G
      
          real    0m21.373s
          user    0m6.113s
          sys     0m15.016s
      
      In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
      
      Note: we may be able to remove this optimization in future if
            we can get pointer to memcg directly from struct page.
      
      [akpm@linux-foundation.org: don't return a void]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4331f7d3
    • K
      memcg: remove PCG_FILE_MAPPED · 2ff76f11
      KAMEZAWA Hiroyuki 提交于
      With the new lock scheme for updating memcg's page stat, we don't need a
      flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
      
      [hughd@google.com: cosmetic fix]
      [hughd@google.com: add comment to MEM_CGROUP_CHARGE_TYPE_MAPPED case in __mem_cgroup_uncharge_common()]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff76f11
    • K
      memcg: use new logic for page stat accounting · 89c06bd5
      KAMEZAWA Hiroyuki 提交于
      Now, page-stat-per-memcg is recorded into per page_cgroup flag by
      duplicating page's status into the flag.  The reason is that memcg has a
      feature to move a page from a group to another group and we have race
      between "move" and "page stat accounting",
      
      Under current logic, assume CPU-A and CPU-B.  CPU-A does "move" and CPU-B
      does "page stat accounting".
      
      When CPU-A goes 1st,
      
                  CPU-A                           CPU-B
                                          update "struct page" info.
          move_lock_mem_cgroup(memcg)
          see pc->flags
          copy page stat to new group
          overwrite pc->mem_cgroup.
          move_unlock_mem_cgroup(memcg)
                                          move_lock_mem_cgroup(mem)
                                          set pc->flags
                                          update page stat accounting
                                          move_unlock_mem_cgroup(mem)
      
      stat accounting is guarded by move_lock_mem_cgroup() and "move" logic
      (CPU-A) doesn't see changes in "struct page" information.
      
      But it's costly to have the same information both in 'struct page' and
      'struct page_cgroup'.  And, there is a potential problem.
      
      For example, assume we have PG_dirty accounting in memcg.
      PG_..is a flag for struct page.
      PCG_ is a flag for struct page_cgroup.
      (This is just an example. The same problem can be found in any
       kind of page stat accounting.)
      
      	  CPU-A                               CPU-B
            TestSet PG_dirty
            (delay)                        TestClear PG_dirty
                                           if (TestClear(PCG_dirty))
                                                memcg->nr_dirty--
            if (TestSet(PCG_dirty))
                memcg->nr_dirty++
      
      Here, memcg->nr_dirty = +1, this is wrong.  This race was reported by Greg
      Thelen <gthelen@google.com>.  Now, only FILE_MAPPED is supported but
      fortunately, it's serialized by page table lock and this is not real bug,
      _now_,
      
      If this potential problem is caused by having duplicated information in
      struct page and struct page_cgroup, we may be able to fix this by using
      original 'struct page' information.  But we'll have a problem in "move
      account"
      
      Assume we use only PG_dirty.
      
               CPU-A                   CPU-B
          TestSet PG_dirty
          (delay)                    move_lock_mem_cgroup()
                                     if (PageDirty(page))
                                            new_memcg->nr_dirty++
                                     pc->mem_cgroup = new_memcg;
                                     move_unlock_mem_cgroup()
          move_lock_mem_cgroup()
          memcg = pc->mem_cgroup
          new_memcg->nr_dirty++
      
      accounting information may be double-counted.  This was original reason to
      have PCG_xxx flags but it seems PCG_xxx has another problem.
      
      I think we need a bigger lock as
      
           move_lock_mem_cgroup(page)
           TestSetPageDirty(page)
           update page stats (without any checks)
           move_unlock_mem_cgroup(page)
      
      This fixes both of problems and we don't have to duplicate page flag into
      page_cgroup.  Please note: move_lock_mem_cgroup() is held only when there
      are possibility of "account move" under the system.  So, in most path,
      status update will go without atomic locks.
      
      This patch introduces mem_cgroup_begin_update_page_stat() and
      mem_cgroup_end_update_page_stat() both should be called at modifying
      'struct page' information if memcg takes care of it.  as
      
           mem_cgroup_begin_update_page_stat()
           modify page information
           mem_cgroup_update_page_stat()
           => never check any 'struct page' info, just update counters.
           mem_cgroup_end_update_page_stat().
      
      This patch is slow because we need to call begin_update_page_stat()/
      end_update_page_stat() regardless of accounted will be changed or not.  A
      following patch adds an easy optimization and reduces the cost.
      
      [akpm@linux-foundation.org: s/lock/locked/]
      [hughd@google.com: fix deadlock by avoiding stat lock when anon]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Greg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Ying Han <yinghan@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89c06bd5