- 24 2月, 2013 1 次提交
-
-
由 Sha Zhengju 提交于
Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in "Memory cgroup stats for XXX" section. Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2013 1 次提交
-
-
由 Glauber Costa 提交于
The designed workflow for the caches in kmemcg is: register it with memcg_register_cache() if kmemcg is already available or later on when a new kmemcg appears at memcg_update_cache_sizes() which will handle all caches in the system. The caches created at boot time will be handled by the later, and the memcg-caches as well as any system caches that are registered later on by the former. There is a bug, however, in memcg_register_cache: we correctly set up the array size, but do not mark the cache as a root cache. This means that allocations for any cache appearing late in the game will see memcg->memcg_params->is_root_cache == false, and in particular, trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in __memcg_kmem_cache_get. The obvious fix is to include the missing assignment. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 12月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Commit 648bb56d ("cgroup: lock cgroup_mutex in cgroup_init_subsys()") made cgroup_init_subsys() grab cgroup_mutex before invoking ->css_alloc() for the root css. Because memcg registers hotcpu notifier from ->css_alloc() for the root css, this introduced circular locking dependency between cgroup_mutex and cpu hotplug. Fix it by moving hotcpu notifier registration to a subsys initcall. ====================================================== [ INFO: possible circular locking dependency detected ] 3.7.0-rc4-work+ #42 Not tainted ------------------------------------------------------- bash/645 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110c5b7>] cgroup_lock+0x17/0x20 but task is already holding lock: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug.lock){+.+.+.}: lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 get_online_cpus+0x3c/0x60 rebuild_sched_domains_locked+0x1b/0x70 cpuset_write_resmask+0x298/0x2c0 cgroup_file_write+0x1ef/0x300 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b -> #0 (cgroup_mutex){+.+.+.}: __lock_acquire+0x14ce/0x1d20 lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 cgroup_lock+0x17/0x20 cpuset_handle_hotplug+0x1b/0x560 cpuset_update_active_cpus+0xe/0x10 cpuset_cpu_inactive+0x47/0x50 notifier_call_chain+0x66/0x150 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x20/0x40 _cpu_down+0x7e/0x2f0 cpu_down+0x36/0x50 store_online+0x5d/0xe0 dev_attr_store+0x18/0x30 sysfs_write_file+0xe0/0x150 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(cgroup_mutex); lock(cpu_hotplug.lock); lock(cgroup_mutex); *** DEADLOCK *** 5 locks held by bash/645: #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8123bab8>] sysfs_write_file+0x48/0x150 #1: (s_active#42){.+.+.+}, at: [<ffffffff8123bb38>] sysfs_write_file+0xc8/0x150 #2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<ffffffff81079277>] cpu_hotplug_driver_lock+0x1 +7/0x20 #3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81093157>] cpu_maps_update_begin+0x17/0x20 #4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60 stack backtrace: Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42 Call Trace: print_circular_bug+0x28e/0x29f __lock_acquire+0x14ce/0x1d20 lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 cgroup_lock+0x17/0x20 cpuset_handle_hotplug+0x1b/0x560 cpuset_update_active_cpus+0xe/0x10 cpuset_cpu_inactive+0x47/0x50 notifier_call_chain+0x66/0x150 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x20/0x40 _cpu_down+0x7e/0x2f0 cpu_down+0x36/0x50 store_online+0x5d/0xe0 dev_attr_store+0x18/0x30 sysfs_write_file+0xe0/0x150 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 12月, 2012 19 次提交
-
-
由 Glauber Costa 提交于
SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
When we create caches in memcgs, we need to display their usage information somewhere. We'll adopt a scheme similar to /proc/meminfo, with aggregate totals shown in the global file, and per-group information stored in the group itself. For the time being, only reads are allowed in the per-group cache. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This means that when we destroy a memcg cache that happened to be empty, those caches may take a lot of time to go away: removing the memcg reference won't destroy them - because there are pending references, and the empty pages will stay there, until a shrinker is called upon for any reason. In this patch, we will call kmem_cache_shrink() for all dead caches that cannot be destroyed because of remaining pages. After shrinking, it is possible that it could be freed. If this is not the case, we'll schedule a lazy worker to keep trying. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This enables us to remove all the children of a kmem_cache being destroyed, if for example the kernel module it's being used in gets unloaded. Otherwise, the children will still point to the destroyed parent. Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Implement destruction of memcg caches. Right now, only caches where our reference counter is the last remaining are deleted. If there are any other reference counters around, we just leave the caches lying around until they go away. When that happens, a destruction function is called from the cache code. Caches are only destroyed in process context, so we queue them up for later processing in the general case. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Create a mechanism that skip memcg allocations during certain pieces of our core code. It basically works in the same way as preempt_disable()/preempt_enable(): By marking a region under which all allocations will be accounted to the root memcg. We need this to prevent races in early cache creation, when we allocate data using caches that are not necessarily created already. Signed-off-by: NGlauber Costa <glommer@parallels.com> yCc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
The page allocator is able to bind a page to a memcg when it is allocated. But for the caches, we'd like to have as many objects as possible in a page belonging to the same cache. This is done in this patch by calling memcg_kmem_get_cache in the beginning of every allocation function. This function is patched out by static branches when kernel memory controller is not being used. It assumes that the task allocating, which determines the memcg in the page allocator, belongs to the same cgroup throughout the whole process. Misaccounting can happen if the task calls memcg_kmem_get_cache() while belonging to a cgroup, and later on changes. This is considered acceptable, and should only happen upon task migration. Before the cache is created by the memcg core, there is also a possible imbalance: the task belongs to a memcg, but the cache being allocated from is the global cache, since the child cache is not yet guaranteed to be ready. This case is also fine, since in this case the GFP_KMEMCG will not be passed and the page allocator will not attempt any cgroup accounting. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Every cache that is considered a root cache (basically the "original" caches, tied to the root memcg/no-memcg) will have an array that should be large enough to store a cache pointer per each memcg in the system. Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the 64k pointers range. Most of the time, we won't be using that much. What goes in this patch, is a simple scheme to dynamically allocate such an array, in order to minimize memory usage for memcg caches. Because we would also like to avoid allocations all the time, at least for now, the array will only grow. It will tend to be big enough to hold the maximum number of kmem-limited memcgs ever achieved. We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more than that, we'll start doubling the size of this array every time the limit is reached. Because we are only considering kmem limited memcgs, a natural point for this to happen is when we write to the limit. At that point, we already have set_limit_mutex held, so that will become our natural synchronization mechanism. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Allow a memcg parameter to be passed during cache creation. When the slub allocator is being used, it will only merge caches that belong to the same memcg. We'll do this by scanning the global list, and then translating the cache to a memcg-specific cache Default function is created as a wrapper, passing NULL to the memcg version. We only merge caches that belong to the same memcg. A helper is provided, memcg_css_id: because slub needs a unique cache name for sysfs. Since this is visible, but not the canonical location for slab data, the cache name is not used, the css_id should suffice. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
A lot of the initialization we do in mem_cgroup_create() is done with softirqs enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This means that the freeing of memcg structure must happen in a compatible context, otherwise we'll get a deadlock, like the one below, caught by lockdep: free_accounted_pages+0x47/0x4c free_task+0x31/0x5c __put_task_struct+0xc2/0xdb put_task_struct+0x1e/0x22 delayed_put_task_struct+0x7a/0x98 __rcu_process_callbacks+0x269/0x3df rcu_process_callbacks+0x31/0x5b __do_softirq+0x122/0x277 This usage pattern could not be triggered before kmem came into play. With the introduction of kmem stack handling, it is possible that we call the last mem_cgroup_put() from the task destructor, which is run in an rcu callback. Such callbacks are run with softirqs disabled, leading to the offensive usage pattern. In general, we have little, if any, means to guarantee in which context the last memcg_put will happen. The best we can do is test it and try to make sure no invalid context releases are happening. But as we add more code to memcg, the possible interactions grow in number and expose more ways to get context conflicts. One thing to keep in mind, is that part of the freeing process is already deferred to a worker, such as vfree(), that can only be called from process context. For the moment, the only two functions we really need moved away are: * free_css_id(), and * mem_cgroup_remove_from_trees(). But because the later accesses per-zone info, free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we are left with the per_cpu stats only. Better move it all. Signed-off-by: NGlauber Costa <glommer@parallels.com> Tested-by: NGreg Thelen <gthelen@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, we can't guarantee that we'll always be able to point a page to a particular process, and migrate the charges along with it - since in the common case, a page will contain data belonging to multiple processes. Because of that, when we destroy a memcg, we only make sure the destruction will succeed by discounting the kmem charges from the user charges when we try to empty the cgroup. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We can use static branches to patch the code in or out when not used. Because the _ACTIVE bit on kmem_accounted is only set after the increment is done, we guarantee that the root memcg will always be selected for kmem charges until all call sites are patched (see memcg_kmem_enabled). This guarantees that no mischarges are applied. Static branch decrement happens when the last reference count from the kmem accounting in memcg dies. This will only happen when the charges drop down to 0. When that happens, we need to disable the static branch only on those memcgs that enabled it. To achieve this, we would be forced to complicate the code by keeping track of which memcgs were the ones that actually enabled limits, and which ones got it from its parents. It is a lot simpler just to do static_key_slow_inc() on every child that is accounted. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Because kmem charges can outlive the cgroup, we need to make sure that we won't free the memcg structure while charges are still in flight. For reviewing simplicity, the charge functions will issue mem_cgroup_get() at every charge, and mem_cgroup_put() at every uncharge. This can get expensive, however, and we can do better. mem_cgroup_get() only really needs to be issued once: when the first limit is set. In the same spirit, we only need to issue mem_cgroup_put() when the last charge is gone. We'll need an extra bit in kmem_account_flags for that: KMEM_ACCOUNTED_DEAD. it will be set when the cgroup dies, if there are charges in the group. If there aren't, we can proceed right away. Our uncharge function will have to test that bit every time the charges drop to 0. Because that is not the likely output of res_counter_uncharge, this should not impose a big hit on us: it is certainly much better than a reference count decrease at every operation. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Introduce infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with static branches, so we don't incur any overhead when no mem cgroups with limited kmem are being used. Users of this functionality shall interact with the memcg core code through the following functions: memcg_kmem_newpage_charge: will return true if the group can handle the allocation. At this point, struct page is not yet allocated. memcg_kmem_commit_charge: will either revert the charge, if struct page allocation failed, or embed memcg information into page_cgroup. memcg_kmem_uncharge_page: called at free time, will revert the charge. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Add the basic infrastructure for the accounting of kernel memory. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the "kmem" res_counter. Per cgroup kmem memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled for that group. This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage, until memory.kmem.limit_in_bytes is set for the first time. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub] Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This is just a cleanup patch for clarity of expression. In earlier submissions, people asked it to be in a separate patch, so here it is. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Suleiman Souhlal 提交于
mem_cgroup_do_charge() was written before kmem accounting, and expects three cases: being called for 1 page, being called for a stock of 32 pages, or being called for a hugepage. If we call for 2 or 3 pages (and both the stack and several slabs used in process creation are such, at least with the debug options I had), it assumed it's being called for stock and just retried without reclaiming. Fix that by passing down a minsize argument in addition to the csize. And what to do about that (csize == PAGE_SIZE && ret) retry? If it's needed at all (and presumably is since it's there, perhaps to handle races), then it should be extended to more than PAGE_SIZE, yet how far? And should there be a retry count limit, of what? For now retry up to COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if __GFP_NORETRY. v4: fixed nr pages calculation pointed out by Christoph Lameter. Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Suleiman Souhlal 提交于
We currently have a percpu stock cache scheme that charges one page at a time from memcg->res, the user counter. When the kernel memory controller comes into play, we'll need to charge more than that. This is because kernel memory allocations will also draw from the user counter, and can be bigger than a single page, as it is the case with the stack (usually 2 pages) or some higher order slabs. [glommer@parallels.com: added a changelog ] Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2012 3 次提交
-
-
由 Michal Hocko 提交于
The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the function is either called from the page fault path or vma->vm_mm is used. So the check can be dropped. The check was introduced by commit 456f998e ("memcg: add the pagefault count into memcg stats") because the originally proposed patch used current->mm for shmem but this has been changed to vma->vm_mm later on without the check being removed (thanks to Hugh for this recollection). Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
While profiling numa/core v16 with cgroup_disable=memory on the command line, I noticed mem_cgroup_count_vm_event() still showed up as high as 0.60% in perftop. This occurs because the function is called extremely often even when memcg is disabled. To fix this, inline the check for mem_cgroup_disabled() so we avoid the unnecessary function call if memcg is disabled. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lai Jiangshan 提交于
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 1 次提交
-
-
由 David Rientjes 提交于
mem_cgroup_out_of_memory() is only referenced from within file scope, so it can be marked static. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: NMel Gorman <mgorman@suse.de>
-
- 20 11月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com>
-
- 17 11月, 2012 2 次提交
-
-
由 Hugh Dickins 提交于
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
oom_badness() takes a totalpages argument which says how many pages are available and it uses it as a base for the score calculation. The value is calculated by mem_cgroup_get_limit which considers both limit and total_swap_pages (resp. memsw portion of it). This is usually correct but since fe35004f ("mm: avoid swapping out with swappiness==0") we do not swap when swappiness is 0 which means that we cannot really use up all the totalpages pages. This in turn confuses oom score calculation if the memcg limit is much smaller than the available swap because the used memory (capped by the limit) is negligible comparing to totalpages so the resulting score is too small if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj). A wrong process might be selected as result. The problem can be worked around by checking mem_cgroup_swappiness==0 and not considering swap at all in such a case. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 11月, 2012 4 次提交
-
-
由 Tejun Heo 提交于
All ->pre_destory() implementations return 0 now, which is the only allowed return value. Make it return void. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com>
-
由 Michal Hocko 提交于
Now that pre_destroy callbacks are called from the context where neither any task can attach the group nor any children group can be added there is no other way to fail from mem_cgroup_pre_destroy. mem_cgroup_pre_destroy doesn't have to take a reference to memcg's css because all css' are marked dead already. tj: Remove now unused local variable @cgrp from mem_cgroup_reparent_charges(). Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup destruction rollback somewhat working. cgroup_rmdir() used to drain CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and helpers were used to allow the task performing rmdir to wait for the next relevant event. Unfortunately, the wait is visible to controllers too and the mechanism got exposed to memcg by 88703267 ("cgroup avoid permanent sleep at rmdir"). Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is unnecessary. Remove it and all the mechanisms supporting it. Note that memcontrol.c changes are essentially revert of 88703267 ("cgroup avoid permanent sleep at rmdir"). Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Balbir Singh <bsingharora@gmail.com>
-
由 Tejun Heo 提交于
CSS_REMOVED is one of the several contortions which were necessary to support css reference draining on cgroup removal. All css->refcnts which need draining should be deactivated and verified to equal zero atomically w.r.t. css_tryget(). If any one isn't zero, all refcnts needed to be re-activated and css_tryget() shouldn't fail in the process. This was achieved by letting css_tryget() busy-loop until either the refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set (committing to removal). Now that css refcnt draining is no longer used, there's no need for atomic rollback mechanism. css_tryget() simply can look at the reference count and fail if it's deactivated - it's never getting re-activated. This patch removes CSS_REMOVED and updates __css_tryget() to fail if the refcnt is deactivated. As deactivation and removal are a single step now, they no longer need to be protected against css_tryget() happening from irq context. Remove local_irq_disable/enable() from cgroup_rmdir(). Note that this removes css_is_removed() whose only user is VM_BUG_ON() in memcontrol.c. We can replace it with a check on the refcnt but given that the only use case is a debug assert, I think it's better to simply unexport it. v2: Comment updated and explanation on local_irq_disable/enable() added per Michal Hocko. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com>
-
- 30 10月, 2012 3 次提交
-
-
由 Michal Hocko 提交于
mem_cgroup_force_empty_list currently tries to remove all pages from the given LRU. To prevent from temoporary failures (EBUSY returned by mem_cgroup_move_parent) it uses a margin to the current LRU pages and returns the true if there are still some pages left on the list. If we consider that mem_cgroup_move_parent fails only when it is racing with somebody else removing (uncharging) the page or when the page is migrated then it is obvious that all those failures are only temporal and so we can safely retry later. Let's get rid of the safety margin and make the loop really wait for the empty LRU. The caller should still make sure that all charges have been removed from the res_counter because mem_cgroup_replace_page_cache might add a page to the LRU after the list_empty check (it doesn't touch res_counter though). This catches most of the cases except for shmem which might call mem_cgroup_replace_page_cache with a page which is not charged and on the LRU yet but this was the case also without this patch. In order to fix this we need a guarantee that try_get_mem_cgroup_from_page falls back to the current mm's cgroup so it needs css_tryget to fail. This will be fixed up in a later patch because it needs a help from cgroup core (pre_destroy has to be called after css is cleared). Although mem_cgroup_pre_destroy can still fail (if a new task or a new sub-group appears) there is no reason to retry pre_destroy callback from the cgroup core. This means that __DEPRECATED_clear_css_refs has lost its meaning and it can be removed. Changes since v2 - remove __DEPRECATED_clear_css_refs Changes since v1 - use kerndoc - be more specific about mem_cgroup_move_parent possible failures Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NGlauber Costa <glommer@parallels.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Michal Hocko 提交于
The root cgroup cannot be destroyed so we never hit it down the mem_cgroup_pre_destroy path and mem_cgroup_force_empty_write shouldn't even try to do anything if called for the root. This means that mem_cgroup_move_parent doesn't have to bother with the root cgroup and it can assume it can always move charges upwards. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NGlauber Costa <glommer@parallels.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Michal Hocko 提交于
mem_cgroup_force_empty did two separate things depending on free_all parameter from the very beginning. It either reclaimed as many pages as possible and moved the rest to the parent or just moved charges to the parent. The first variant is used as memory.force_empty callback while the later is used from the mem_cgroup_pre_destroy. The whole games around gotos are far from being nice and there is no reason to keep those two functions inside one. Let's split them and also move the responsibility for css reference counting to their callers to make to code easier. This patch doesn't have any functional changes. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NTejun Heo <tj@kernel.org> Reviewed-by: NGlauber Costa <glommer@parallels.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 09 10月, 2012 2 次提交
-
-
由 Michal Hocko 提交于
kmem code uses this function and it is better to not use forward declarations for static inline functions as some (older) compilers don't like it: gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) mm/memcontrol.c:421: warning: `mem_cgroup_is_root' declared inline after being called mm/memcontrol.c:421: warning: previous declaration of `mem_cgroup_is_root' was here Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Cc: Sachin Kamat <sachin.kamat@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
TCP kmem accounting is currently guarded by CONFIG_MEMCG_KMEM ifdefs but the code is not used if !CONFIG_INET so we should rather test for both. The same applies to net/sock.h, net/ip.h and net/tcp_memcontrol.h but let's keep those outside of any ifdefs because it is considered safer wrt. future maintainability. Tested with - CONFIG_INET && CONFIG_MEMCG_KMEM - !CONFIG_INET && CONFIG_MEMCG_KMEM - CONFIG_INET && !CONFIG_MEMCG_KMEM - !CONFIG_INET && !CONFIG_MEMCG_KMEM Signed-off-by: NSachin Kamat <sachin.kamat@linaro.org> Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 9月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup hierarchy support is a mess. cpu related subsystems behave correctly - configuration, accounting and control on a parent properly cover its children. blkio and freezer completely ignore hierarchy and treat all cgroups as if they're directly under the root cgroup. Others show yet different behaviors. These differing interpretations of cgroup hierarchy make using cgroup confusing and it impossible to co-mount controllers into the same hierarchy and obtain sane behavior. Eventually, we want full hierarchy support from all subsystems and probably a unified hierarchy. Users using separate hierarchies expecting completely different behaviors depending on the mounted subsystem is deterimental to making any progress on this front. This patch adds cgroup_subsys.broken_hierarchy and sets it to %true for controllers which are lacking in hierarchy support. The goal of this patch is two-fold. * Move users away from using hierarchy on currently non-hierarchical subsystems, so that implementing proper hierarchy support on those doesn't surprise them. * Keep track of which controllers are broken how and nudge the subsystems to implement proper hierarchy support. For now, start with a single warning message. We can whine louder later on. v2: Fixed a typo spotted by Michal. Warning message updated. v3: Updated memcg part so that it doesn't generate warning in the cases where .use_hierarchy=false doesn't make the behavior different from root.use_hierarchy=true. Fixed a typo spotted by Glauber. v4: Check ->broken_hierarchy after cgroup creation is complete so that ->create() can affect the result per Michal. Dropped unnecessary memcg root handling per Michal. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NSerge E. Hallyn <serue@us.ibm.com> Cc: Glauber Costa <glommer@parallels.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Thomas Graf <tgraf@suug.ch> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Neil Horman <nhorman@tuxdriver.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-