1. 16 12月, 2009 2 次提交
  2. 04 12月, 2009 1 次提交
  3. 09 11月, 2009 1 次提交
  4. 02 10月, 2009 3 次提交
    • K
      memcg: reduce check for softlimit excess · ef8745c1
      KAMEZAWA Hiroyuki 提交于
      In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
      and it takes res_counter's spin_lock every time.
      
      This patch removes unnecessary calls for res_count_soft_limit_excess.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8745c1
    • K
      memcg: some modification to softlimit under hierarchical memory reclaim. · 4e649152
      KAMEZAWA Hiroyuki 提交于
      This patch clean up/fixes for memcg's uncharge soft limit path.
      
      Problems:
        Now, res_counter_charge()/uncharge() handles softlimit information at
        charge/uncharge and softlimit-check is done when event counter per memcg
        goes over limit. Now, event counter per memcg is updated only when
        memory usage is over soft limit. Here, considering hierarchical memcg
        management, ancesotors should be taken care of.
      
        Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
        This is not good.
      
        Prolems:
        1. memcg's event counter incremented only when softlimit hits. That's bad.
           It makes event counter hard to be reused for other purpose.
      
        2. At uncharge, only the lowest level rescounter is handled. This is bug.
           Because ancesotor's event counter is not incremented, children should
           take care of them.
      
        3. res_counter_uncharge()'s 3rd argument is NULL in most case.
           ops under res_counter->lock should be small. No "if" sentense is better.
      
      Fixes:
        * Removed soft_limit_xx poitner and checks in charge and uncharge.
          Do-check-only-when-necessary scheme works enough well without them.
      
        * make event-counter of memcg incremented at every charge/uncharge.
          (per-cpu area will be accessed soon anyway)
      
        * All ancestors are checked at soft-limit-check. This is necessary because
          ancesotor's event counter may never be modified. Then, they should be
          checked at the same time.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e649152
    • K
      memcg: fix refcnt going negative · 26251eaf
      KAMEZAWA Hiroyuki 提交于
      __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
      with incremnted mz->mem->css's refcnt.  Then, the caller of this function
      has to call css_put(mz->mem->css).
      
      But, mz can be !NULL even if "not found" i.e.  without css_get().  By
      this, css->refcnt will go down to minus.
      
      This may cause various things...one of results will be
      initite-loop in css_tryget()  as this.
      
      INFO: RCU detected CPU 0 stall (t=10000 jiffies)
      sending NMI to all CPUs:
      NMI backtrace for cpu 0
      CPU 0:
      <snip>
      
       <<EOE>>  <IRQ>  [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
        [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
        [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
        [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
        [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
        [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
        [<ffffffff81063a26>] update_process_times+0x46/0x70
        [<ffffffff81085930>] tick_sched_timer+0x60/0x160
        [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
        [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
        [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
        [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
        [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
        [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
        <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
        [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
        [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
        [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
        [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
        [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
      
      Above shows CPU0 caught in css_tryget()'s inifinite loop because
      of bad refcnt.
      
      This is a fix to set mz=NULL at the top of retry path.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26251eaf
  5. 24 9月, 2009 9 次提交
    • D
      memcg: show swap usage in stat file · 1dd3a273
      Daisuke Nishimura 提交于
      We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage.  It would
      be useful for users to show swap usage in memory.stat file, because they
      don't need calculate memsw.usage - res.usage to know swap usage.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd3a273
    • B
      memcg: improve resource counter scalability · 0c3e73e8
      Balbir Singh 提交于
      Reduce the resource counter overhead (mostly spinlock) associated with the
      root cgroup.  This is a part of the several patches to reduce mem cgroup
      overhead.  I had posted other approaches earlier (including using percpu
      counters).  Those patches will be a natural addition and will be added
      iteratively on top of these.
      
      The patch stops resource counter accounting for the root cgroup.  The data
      for display is derived from the statisitcs we maintain via
      mem_cgroup_charge_statistics (which is more scalable).  What happens today
      is that, we do double accounting, once using res_counter_charge() and once
      using memory_cgroup_charge_statistics().  For the root, since we don't
      implement limits any more, we don't need to track every charge via
      res_counter_charge() and check for limit being exceeded and reclaim.
      
      The main mem->res usage_in_bytes can be derived by summing the cache and
      rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
      additional data about swapped out memory.  This patch adds a
      MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
      recursively when hierarchy is enabled.
      
      The tests results I see on a 24 way show that
      
      1. The lock contention disappears from /proc/lock_stats
      2. The results of the test are comparable to running with
         cgroup_disable=memory.
      
      Here is a sample of my program runs
      
      Without Patch
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7192804.124144  task-clock-msecs         #     23.937 CPUs
               424691  context-switches         #      0.000 M/sec
                  267  CPU-migrations           #      0.000 M/sec
             28498113  page-faults              #      0.004 M/sec
        5826093739340  cycles                   #    809.989 M/sec
         408883496292  instructions             #      0.070 IPC
           7057079452  cache-references         #      0.981 M/sec
           3036086243  cache-misses             #      0.422 M/sec
      
        300.485365680  seconds time elapsed
      
      With cgroup_disable=memory
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7182183.546587  task-clock-msecs         #     23.915 CPUs
               425458  context-switches         #      0.000 M/sec
                  203  CPU-migrations           #      0.000 M/sec
             92545093  page-faults              #      0.013 M/sec
        6034363609986  cycles                   #    840.185 M/sec
         437204346785  instructions             #      0.072 IPC
           6636073192  cache-references         #      0.924 M/sec
           2358117732  cache-misses             #      0.328 M/sec
      
        300.320905827  seconds time elapsed
      
      With this patch applied
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7191619.223977  task-clock-msecs         #     23.955 CPUs
               422579  context-switches         #      0.000 M/sec
                   88  CPU-migrations           #      0.000 M/sec
             91946060  page-faults              #      0.013 M/sec
        5957054385619  cycles                   #    828.333 M/sec
        1058117350365  instructions             #      0.178 IPC
           9161776218  cache-references         #      1.274 M/sec
           1920494280  cache-misses             #      0.267 M/sec
      
        300.218764862  seconds time elapsed
      
      Data from Prarit (kernel compile with make -j64 on a 64
      CPU/32G machine)
      
      For a single run
      
      Without patch
      
      real 27m8.988s
      user 87m24.916s
      sys 382m6.037s
      
      With patch
      
      real    4m18.607s
      user    84m58.943s
      sys     50m52.682s
      
      With config turned off
      
      real    4m54.972s
      user    90m13.456s
      sys     50m19.711s
      
      NOTE: The data looks counterintuitive due to the increased performance
      with the patch, even over the config being turned off. We probably need
      more runs, but so far all testing has shown that the patches definitely
      help.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c3e73e8
    • B
      memory controller: soft limit reclaim on contention · 4e416953
      Balbir Singh 提交于
      Implement reclaim from groups over their soft limit
      
      Permit reclaim from memory cgroups on contention (via the direct reclaim
      path).
      
      memory cgroup soft limit reclaim finds the group that exceeds its soft
      limit by the largest number of pages and reclaims pages from it and then
      reinserts the cgroup into its correct place in the rbtree.
      
      Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
      loops in case all swap is turned off.  The code has been refactored and
      the loop check (loop < 2) has been enhanced for soft limits.  For soft
      limits, we try to do more targetted reclaim.  Instead of bailing out after
      two loops, the routine now reclaims memory proportional to the size by
      which the soft limit is exceeded.  The proportion has been empirically
      determined.
      
      [akpm@linux-foundation.org: build fix]
      [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
      [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e416953
    • B
      memory controller: soft limit refactor reclaim flags · 75822b44
      Balbir Singh 提交于
      Refactor mem_cgroup_hierarchical_reclaim()
      
      Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
      flags, so that new parameters don't have to be passed as we make the
      reclaim routine more flexible
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75822b44
    • B
      memory controller: soft limit organize cgroups · f64c3f54
      Balbir Singh 提交于
      Organize cgroups over soft limit in a RB-Tree
      
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      
      [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64c3f54
    • B
      memory controller: soft limit interface · 296c81d8
      Balbir Singh 提交于
      Add an interface to allow get/set of soft limits.  Soft limits for memory
      plus swap controller (memsw) is currently not supported.  Resource
      counters have been enhanced to support soft limits and new type
      RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
      directly set and do not need any reclaim or checks before setting them to
      a newer value.
      
      Kamezawa-San raised a question as to whether soft limit should belong to
      res_counter.  Since all resources understand the basic concepts of hard
      and soft limits, it is justified to add soft limits here.  Soft limits are
      a generic resource usage feature, even file system quotas support soft
      limits.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296c81d8
    • K
      memcg: add comments explaining memory barriers · 261fb61a
      KAMEZAWA Hiroyuki 提交于
      Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      261fb61a
    • B
      memcg: remove the overhead associated with the root cgroup · 4b3bde4c
      Balbir Singh 提交于
      Change the memory cgroup to remove the overhead associated with accounting
      all pages in the root cgroup.  As a side-effect, we can no longer set a
      memory hard limit in the root cgroup.
      
      A new flag to track whether the page has been accounted or not has been
      added as well.  Flags are now set atomically for page_cgroup,
      pcg_default_flags is now obsolete and removed.
      
      [akpm@linux-foundation.org: fix a few documentation glitches]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b3bde4c
    • B
      cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time · be367d09
      Ben Blum 提交于
      Alter the ss->can_attach and ss->attach functions to be able to deal with
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      
      [hidave.darkstar@gmail.com: fix build]
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be367d09
  6. 22 9月, 2009 1 次提交
  7. 30 7月, 2009 1 次提交
  8. 11 7月, 2009 1 次提交
  9. 19 6月, 2009 5 次提交
  10. 17 6月, 2009 1 次提交
    • R
      vmscan: evict use-once pages first · 56e49d21
      Rik van Riel 提交于
      When the file LRU lists are dominated by streaming IO pages, evict those
      pages first, before considering evicting other pages.
      
      This should be safe from deadlocks or performance problems
      because only three things can happen to an inactive file page:
      
      1) referenced twice and promoted to the active list
      2) evicted by the pageout code
      3) under IO, after which it will get evicted or promoted
      
      The pages freed in this way can either be reused for streaming IO, or
      allocated for something else.  If the pages are used for streaming IO,
      this pageout pattern continues.  Otherwise, we will fall back to the
      normal pageout pattern.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NElladan <elladan@eskimo.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56e49d21
  11. 29 5月, 2009 2 次提交
  12. 03 5月, 2009 2 次提交
  13. 14 4月, 2009 1 次提交
  14. 03 4月, 2009 10 次提交
    • D
      memcg: cleanup cache_charge · 83aae4c7
      Daisuke Nishimura 提交于
      Current mem_cgroup_cache_charge is a bit complicated especially
      in the case of shmem's swap-in.
      
      This patch cleans it up by using try_charge_swapin and commit_charge_swapin.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83aae4c7
    • K
      cgroups: use css id in swap cgroup for saving memory v5 · a3b2d692
      KAMEZAWA Hiroyuki 提交于
      Try to use CSS ID for records in swap_cgroup.  By this, on 64bit machine,
      size of swap_cgroup goes down to 2 bytes from 8bytes.
      
      This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)
      
      	From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
      	To   size of swap_cgroup = 2G/4k * 2 = 1Mbytes.
      
      Reduction is large.  Of course, there are trade-offs.  This CSS ID will
      add overhead to swap-in/swap-out/swap-free.
      
      But in general,
        - swap is a resource which the user tend to avoid use.
        - If swap is never used, swap_cgroup area is not used.
        - Reading traditional manuals, size of swap should be proportional to
          size of memory. Memory size of machine is increasing now.
      
      I think reducing size of swap_cgroup makes sense.
      
      Note:
        - ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
        - memcg can be obsolete at rmdir() but not freed while refcnt from
          swap_cgroup is available.
      
      Changelog v4->v5:
       - reworked on to memcg-charge-swapcache-to-proper-memcg.patch
      Changlog ->v4:
       - fixed not configured case.
       - deleted unnecessary comments.
       - fixed NULL pointer bug.
       - fixed message in dmesg.
      
      [nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b2d692
    • D
      memcg: charge swapcache to proper memcg · 3c776e64
      Daisuke Nishimura 提交于
      memcg_test.txt says at 4.1:
      
      	This swap-in is one of the most complicated work. In do_swap_page(),
      	following events occur when pte is unchanged.
      
      	(1) the page (SwapCache) is looked up.
      	(2) lock_page()
      	(3) try_charge_swapin()
      	(4) reuse_swap_page() (may call delete_swap_cache())
      	(5) commit_charge_swapin()
      	(6) swap_free().
      
      	Considering following situation for example.
      
      	(A) The page has not been charged before (2) and reuse_swap_page()
      	    doesn't call delete_from_swap_cache().
      	(B) The page has not been charged before (2) and reuse_swap_page()
      	    calls delete_from_swap_cache().
      	(C) The page has been charged before (2) and reuse_swap_page() doesn't
      	    call delete_from_swap_cache().
      	(D) The page has been charged before (2) and reuse_swap_page() calls
      	    delete_from_swap_cache().
      
      	    memory.usage/memsw.usage changes to this page/swp_entry will be
      	 Case          (A)      (B)       (C)     (D)
               Event
             Before (2)     0/ 1     0/ 1      1/ 1    1/ 1
                ===========================================
                (3)        +1/+1    +1/+1     +1/+1   +1/+1
                (4)          -       0/ 0       -     -1/ 0
                (5)         0/-1     0/ 0     -1/-1    0/ 0
                (6)          -       0/-1       -      0/-1
                ===========================================
             Result         1/ 1     1/ 1      1/ 1    1/ 1
      
             In any cases, charges to this page should be 1/ 1.
      
      In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
      (because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
      charges to the memcg("foo") to which the "current" belongs.
      OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
      to which the page has been charged.
      
      So, if the "foo" and "baa" is different(for example because of task move),
      this charge will be moved from "baa" to "foo".
      
      I think this is an unexpected behavior.
      
      This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
      to return the memcg to which the swapcache has been charged if PCG_USED bit
      is set.
      IIUC, checking PCG_USED bit of swapcache is safe under page lock.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c776e64
    • K
      memcg: remove mem_cgroup_calc_mapped_ratio() · c137b5ec
      KOSAKI Motohiro 提交于
      Currently, mem_cgroup_calc_mapped_ratio() is unused at all.  it can be
      removed and KAMEZAWA-san suggested it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c137b5ec
    • B
      memcg: show memcg information during OOM · e222432b
      Balbir Singh 提交于
      Add RSS and swap to OOM output from memcg
      
      Display memcg values like failcnt, usage and limit when an OOM occurs due
      to memcg.
      
      Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
      Daisuke Nishimura and KOSAKI Motohiro for review.
      
      Sample output
      -------------
      
      Task in /a/x killed as a result of limit of /a
      memory: usage 1048576kB, limit 1048576kB, failcnt 4183
      memory+swap: usage 1400964akB, limit 9007199254740991kB, failcnt 0
      
      [akpm@linux-foundation.org: compilation fix]
      [akpm@linux-foundation.org: fix kerneldoc and whitespace]
      [akpm@linux-foundation.org: add printk facility level]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e222432b
    • K
      memcg: fix OOM killer under memcg · 0b7f569e
      KAMEZAWA Hiroyuki 提交于
      This patch tries to fix OOM Killer problems caused by hierarchy.
      Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
      kill a task in memcg.
      
      But, when hierarchy is used, it's broken and correct task cannot
      be killed. For example, in following cgroup
      
      	/groupA/	hierarchy=1, limit=1G,
      		01	nolimit
      		02	nolimit
      All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
      groupA's 1Gbytes but OOM Killer just kills tasks in groupA.
      
      This patch provides makes the bad process be selected from all tasks
      under hierarchy. BTW, currently, oom_jiffies is updated against groupA
      in above case. oom_jiffies of tree should be updated.
      
      To see how oom_jiffies is used, please check mem_cgroup_oom_called()
      callers.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: const fix]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b7f569e
    • K
      memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm · 81d39c20
      KAMEZAWA Hiroyuki 提交于
      As pointed out, shrinking memcg's limit should return -EBUSY after
      reasonable retries.  This patch tries to fix the current behavior of
      shrink_usage.
      
      Before looking into "shrink should return -EBUSY" problem, we should fix
      hierarchical reclaim code.  It compares current usage and current limit,
      but it only makes sense when the kernel reclaims memory because hit
      limits.  This is also a problem.
      
      What this patch does are.
      
        1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
           hierarchical reclaim returns immediately and the caller checks the kernel
           should shrink more or not.
           (At shrinking memory, usage is always smaller than limit. So check for
            usage < limit is useless.)
      
        2. For adjusting to above change, 2 changes in "shrink"'s retry path.
           2-a. retry_count depends on # of children because the kernel visits
      	  the children under hierarchy one by one.
           2-b. rather than checking return value of hierarchical_reclaim's progress,
      	  compares usage-before-shrink and usage-after-shrink.
      	  If usage-before-shrink <= usage-after-shrink, retry_count is
      	  decremented.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d39c20
    • K
      memcg: hierarchical stat · 14067bb3
      KAMEZAWA Hiroyuki 提交于
      Clean up memory.stat file routine and show "total" hierarchical stat.
      
      This patch does
        - renamed get_all_zonestat to be get_local_zonestat.
        - remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
        - add mcs_stat to cover both of per-cpu/per-lru stat.
        - add "total" stat of hierarchy (*)
        - add a callback system to scan all memcg under a root.
      == "total" is added.
      [kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
      cache 0
      rss 0
      pgpgin 0
      pgpgout 0
      inactive_anon 0
      active_anon 0
      inactive_file 0
      active_file 0
      unevictable 0
      hierarchical_memory_limit 50331648
      hierarchical_memsw_limit 9223372036854775807
      total_cache 65536
      total_rss 192512
      total_pgpgin 218
      total_pgpgout 155
      total_inactive_anon 0
      total_active_anon 135168
      total_inactive_file 61440
      total_active_file 4096
      total_unevictable 0
      ==
      (*) maybe the user can do calc hierarchical stat by his own program
         in userland but if it can be written in clean way, it's worth to be
         shown, I think.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14067bb3
    • K
      memcg: use CSS ID · 04046e1a
      KAMEZAWA Hiroyuki 提交于
      Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.
      
      	Assume folloing tree.
      
      	group_A (ID=3)
      		/01 (ID=4)
      		   /0A (ID=7)
      		/02 (ID=10)
      	group_B (ID=5)
      	and task in group_A/01/0A hits limit at group_A.
      
      	reclaim will be done in following order (round-robin).
      	group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
      	-> group_A -> .....
      
      	Round robin by ID. The last visited cgroup is recorded and restart
      	from it when it start reclaim again.
      	(More smart algorithm can be implemented..)
      
      	No cgroup_mutex or hierarchy_mutex is required.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04046e1a
    • K
      cgroup: fix frequent -EBUSY at rmdir · ec64f515
      KAMEZAWA Hiroyuki 提交于
      In following situation, with memory subsystem,
      
      	/groupA use_hierarchy==1
      		/01 some tasks
      		/02 some tasks
      		/03 some tasks
      		/04 empty
      
      When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
      is triggered and the kernel walks tree under groupA. In this case,
      rmdir /groupA/04 fails with -EBUSY frequently because of temporal
      refcnt from the kernel.
      
      In general. cgroup can be rmdir'd if there are no children groups and
      no tasks. Frequent fails of rmdir() is not useful to users.
      (And the reason for -EBUSY is unknown to users.....in most cases)
      
      This patch tries to modify above behavior, by
      	- retries if css_refcnt is got by someone.
      	- add "return value" to pre_destroy() and allows subsystem to
      	  say "we're really busy!"
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec64f515