1. 16 12月, 2009 12 次提交
  2. 04 12月, 2009 1 次提交
  3. 09 11月, 2009 1 次提交
  4. 02 10月, 2009 3 次提交
    • K
      memcg: reduce check for softlimit excess · ef8745c1
      KAMEZAWA Hiroyuki 提交于
      In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
      and it takes res_counter's spin_lock every time.
      
      This patch removes unnecessary calls for res_count_soft_limit_excess.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8745c1
    • K
      memcg: some modification to softlimit under hierarchical memory reclaim. · 4e649152
      KAMEZAWA Hiroyuki 提交于
      This patch clean up/fixes for memcg's uncharge soft limit path.
      
      Problems:
        Now, res_counter_charge()/uncharge() handles softlimit information at
        charge/uncharge and softlimit-check is done when event counter per memcg
        goes over limit. Now, event counter per memcg is updated only when
        memory usage is over soft limit. Here, considering hierarchical memcg
        management, ancesotors should be taken care of.
      
        Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
        This is not good.
      
        Prolems:
        1. memcg's event counter incremented only when softlimit hits. That's bad.
           It makes event counter hard to be reused for other purpose.
      
        2. At uncharge, only the lowest level rescounter is handled. This is bug.
           Because ancesotor's event counter is not incremented, children should
           take care of them.
      
        3. res_counter_uncharge()'s 3rd argument is NULL in most case.
           ops under res_counter->lock should be small. No "if" sentense is better.
      
      Fixes:
        * Removed soft_limit_xx poitner and checks in charge and uncharge.
          Do-check-only-when-necessary scheme works enough well without them.
      
        * make event-counter of memcg incremented at every charge/uncharge.
          (per-cpu area will be accessed soon anyway)
      
        * All ancestors are checked at soft-limit-check. This is necessary because
          ancesotor's event counter may never be modified. Then, they should be
          checked at the same time.
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e649152
    • K
      memcg: fix refcnt going negative · 26251eaf
      KAMEZAWA Hiroyuki 提交于
      __mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
      with incremnted mz->mem->css's refcnt.  Then, the caller of this function
      has to call css_put(mz->mem->css).
      
      But, mz can be !NULL even if "not found" i.e.  without css_get().  By
      this, css->refcnt will go down to minus.
      
      This may cause various things...one of results will be
      initite-loop in css_tryget()  as this.
      
      INFO: RCU detected CPU 0 stall (t=10000 jiffies)
      sending NMI to all CPUs:
      NMI backtrace for cpu 0
      CPU 0:
      <snip>
      
       <<EOE>>  <IRQ>  [<ffffffff810884bd>] trace_hardirqs_off+0xd/0x10
        [<ffffffff8102a940>] flat_send_IPI_mask+0x90/0xb0
        [<ffffffff8102a9c9>] flat_send_IPI_all+0x69/0x70
        [<ffffffff81027372>] arch_trigger_all_cpu_backtrace+0x62/0xa0
        [<ffffffff810bff8e>] __rcu_pending+0x7e/0x370
        [<ffffffff810c02c7>] rcu_check_callbacks+0x47/0x130
        [<ffffffff81063a26>] update_process_times+0x46/0x70
        [<ffffffff81085930>] tick_sched_timer+0x60/0x160
        [<ffffffff810858d0>] ? tick_sched_timer+0x0/0x160
        [<ffffffff8107a03a>] __run_hrtimer+0xba/0x150
        [<ffffffff8107a325>] hrtimer_interrupt+0xd5/0x1b0
        [<ffffffff81426dfe>] ? trace_hardirqs_off_thunk+0x3a/0x3c
        [<ffffffff8142cacd>] smp_apic_timer_interrupt+0x6d/0x9b
        [<ffffffff8100cb33>] apic_timer_interrupt+0x13/0x20
        <EOI>  [<ffffffff811317b6>] ? mem_cgroup_walk_tree+0x156/0x180
        [<ffffffff811316d3>] ? mem_cgroup_walk_tree+0x73/0x180
        [<ffffffff81131692>] ? mem_cgroup_walk_tree+0x32/0x180
        [<ffffffff81131a00>] ? mem_cgroup_get_local_stat+0x0/0x110
        [<ffffffff81131d5b>] ? mem_control_stat_show+0x14b/0x330
        [<ffffffff810a57fd>] ? cgroup_seqfile_show+0x3d/0x60
      
      Above shows CPU0 caught in css_tryget()'s inifinite loop because
      of bad refcnt.
      
      This is a fix to set mz=NULL at the top of retry path.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26251eaf
  5. 24 9月, 2009 9 次提交
    • D
      memcg: show swap usage in stat file · 1dd3a273
      Daisuke Nishimura 提交于
      We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage.  It would
      be useful for users to show swap usage in memory.stat file, because they
      don't need calculate memsw.usage - res.usage to know swap usage.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dd3a273
    • B
      memcg: improve resource counter scalability · 0c3e73e8
      Balbir Singh 提交于
      Reduce the resource counter overhead (mostly spinlock) associated with the
      root cgroup.  This is a part of the several patches to reduce mem cgroup
      overhead.  I had posted other approaches earlier (including using percpu
      counters).  Those patches will be a natural addition and will be added
      iteratively on top of these.
      
      The patch stops resource counter accounting for the root cgroup.  The data
      for display is derived from the statisitcs we maintain via
      mem_cgroup_charge_statistics (which is more scalable).  What happens today
      is that, we do double accounting, once using res_counter_charge() and once
      using memory_cgroup_charge_statistics().  For the root, since we don't
      implement limits any more, we don't need to track every charge via
      res_counter_charge() and check for limit being exceeded and reclaim.
      
      The main mem->res usage_in_bytes can be derived by summing the cache and
      rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
      additional data about swapped out memory.  This patch adds a
      MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
      MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
      recursively when hierarchy is enabled.
      
      The tests results I see on a 24 way show that
      
      1. The lock contention disappears from /proc/lock_stats
      2. The results of the test are comparable to running with
         cgroup_disable=memory.
      
      Here is a sample of my program runs
      
      Without Patch
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7192804.124144  task-clock-msecs         #     23.937 CPUs
               424691  context-switches         #      0.000 M/sec
                  267  CPU-migrations           #      0.000 M/sec
             28498113  page-faults              #      0.004 M/sec
        5826093739340  cycles                   #    809.989 M/sec
         408883496292  instructions             #      0.070 IPC
           7057079452  cache-references         #      0.981 M/sec
           3036086243  cache-misses             #      0.422 M/sec
      
        300.485365680  seconds time elapsed
      
      With cgroup_disable=memory
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7182183.546587  task-clock-msecs         #     23.915 CPUs
               425458  context-switches         #      0.000 M/sec
                  203  CPU-migrations           #      0.000 M/sec
             92545093  page-faults              #      0.013 M/sec
        6034363609986  cycles                   #    840.185 M/sec
         437204346785  instructions             #      0.072 IPC
           6636073192  cache-references         #      0.924 M/sec
           2358117732  cache-misses             #      0.328 M/sec
      
        300.320905827  seconds time elapsed
      
      With this patch applied
      
       Performance counter stats for '/home/balbir/parallel_pagefault':
      
       7191619.223977  task-clock-msecs         #     23.955 CPUs
               422579  context-switches         #      0.000 M/sec
                   88  CPU-migrations           #      0.000 M/sec
             91946060  page-faults              #      0.013 M/sec
        5957054385619  cycles                   #    828.333 M/sec
        1058117350365  instructions             #      0.178 IPC
           9161776218  cache-references         #      1.274 M/sec
           1920494280  cache-misses             #      0.267 M/sec
      
        300.218764862  seconds time elapsed
      
      Data from Prarit (kernel compile with make -j64 on a 64
      CPU/32G machine)
      
      For a single run
      
      Without patch
      
      real 27m8.988s
      user 87m24.916s
      sys 382m6.037s
      
      With patch
      
      real    4m18.607s
      user    84m58.943s
      sys     50m52.682s
      
      With config turned off
      
      real    4m54.972s
      user    90m13.456s
      sys     50m19.711s
      
      NOTE: The data looks counterintuitive due to the increased performance
      with the patch, even over the config being turned off. We probably need
      more runs, but so far all testing has shown that the patches definitely
      help.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c3e73e8
    • B
      memory controller: soft limit reclaim on contention · 4e416953
      Balbir Singh 提交于
      Implement reclaim from groups over their soft limit
      
      Permit reclaim from memory cgroups on contention (via the direct reclaim
      path).
      
      memory cgroup soft limit reclaim finds the group that exceeds its soft
      limit by the largest number of pages and reclaims pages from it and then
      reinserts the cgroup into its correct place in the rbtree.
      
      Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
      loops in case all swap is turned off.  The code has been refactored and
      the loop check (loop < 2) has been enhanced for soft limits.  For soft
      limits, we try to do more targetted reclaim.  Instead of bailing out after
      two loops, the routine now reclaims memory proportional to the size by
      which the soft limit is exceeded.  The proportion has been empirically
      determined.
      
      [akpm@linux-foundation.org: build fix]
      [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
      [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e416953
    • B
      memory controller: soft limit refactor reclaim flags · 75822b44
      Balbir Singh 提交于
      Refactor mem_cgroup_hierarchical_reclaim()
      
      Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
      flags, so that new parameters don't have to be passed as we make the
      reclaim routine more flexible
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75822b44
    • B
      memory controller: soft limit organize cgroups · f64c3f54
      Balbir Singh 提交于
      Organize cgroups over soft limit in a RB-Tree
      
      Introduce an RB-Tree for storing memory cgroups that are over their soft
      limit.  The overall goal is to
      
      1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
         We are careful about updates, updates take place only after a particular
         time interval has passed
      2. We remove the node from the RB-Tree when the usage goes below the soft
         limit
      
      The next set of patches will exploit the RB-Tree to get the group that is
      over its soft limit by the largest amount and reclaim from it, when we
      face memory contention.
      
      [hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64c3f54
    • B
      memory controller: soft limit interface · 296c81d8
      Balbir Singh 提交于
      Add an interface to allow get/set of soft limits.  Soft limits for memory
      plus swap controller (memsw) is currently not supported.  Resource
      counters have been enhanced to support soft limits and new type
      RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
      directly set and do not need any reclaim or checks before setting them to
      a newer value.
      
      Kamezawa-San raised a question as to whether soft limit should belong to
      res_counter.  Since all resources understand the basic concepts of hard
      and soft limits, it is justified to add soft limits here.  Soft limits are
      a generic resource usage feature, even file system quotas support soft
      limits.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      296c81d8
    • K
      memcg: add comments explaining memory barriers · 261fb61a
      KAMEZAWA Hiroyuki 提交于
      Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      261fb61a
    • B
      memcg: remove the overhead associated with the root cgroup · 4b3bde4c
      Balbir Singh 提交于
      Change the memory cgroup to remove the overhead associated with accounting
      all pages in the root cgroup.  As a side-effect, we can no longer set a
      memory hard limit in the root cgroup.
      
      A new flag to track whether the page has been accounted or not has been
      added as well.  Flags are now set atomically for page_cgroup,
      pcg_default_flags is now obsolete and removed.
      
      [akpm@linux-foundation.org: fix a few documentation glitches]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b3bde4c
    • B
      cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time · be367d09
      Ben Blum 提交于
      Alter the ss->can_attach and ss->attach functions to be able to deal with
      a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
      pre-patch to cgroup-procs-writable.patch.)
      
      Currently, new mode of the attach function can only tell the subsystem
      about the old cgroup of the threadgroup leader.  No subsystem currently
      needs that information for each thread that's being moved, but if one were
      to be added (for example, one that counts tasks within a group) this bit
      would need to be reworked a bit to tell the subsystem the right
      information.
      
      [hidave.darkstar@gmail.com: fix build]
      Signed-off-by: NBen Blum <bblum@google.com>
      Signed-off-by: NPaul Menage <menage@google.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Reviewed-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Dave Young <hidave.darkstar@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be367d09
  6. 22 9月, 2009 1 次提交
  7. 30 7月, 2009 1 次提交
  8. 11 7月, 2009 1 次提交
  9. 19 6月, 2009 5 次提交
  10. 17 6月, 2009 1 次提交
    • R
      vmscan: evict use-once pages first · 56e49d21
      Rik van Riel 提交于
      When the file LRU lists are dominated by streaming IO pages, evict those
      pages first, before considering evicting other pages.
      
      This should be safe from deadlocks or performance problems
      because only three things can happen to an inactive file page:
      
      1) referenced twice and promoted to the active list
      2) evicted by the pageout code
      3) under IO, after which it will get evicted or promoted
      
      The pages freed in this way can either be reused for streaming IO, or
      allocated for something else.  If the pages are used for streaming IO,
      this pageout pattern continues.  Otherwise, we will fall back to the
      normal pageout pattern.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Reported-by: NElladan <elladan@eskimo.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56e49d21
  11. 29 5月, 2009 2 次提交
  12. 03 5月, 2009 2 次提交
  13. 14 4月, 2009 1 次提交