- 24 2月, 2013 12 次提交
-
-
由 Glauber Costa 提交于
When use_hierarchy is enabled, we acquire an extra reference count in our parent during cgroup creation. We don't release it, though, if any failure exist in the creation process. Signed-off-by: NGlauber Costa <glommer@parallels.com> Reported-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We were deferring the kmemcg static branch increment to a later time, due to a nasty dependency between the cpu_hotplug lock, taken by the jump label update, and the cgroup_lock. Now we no longer take the cgroup lock, and we can save ourselves the trouble. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
After the preparation work done in earlier patches, the cgroup_lock can be trivially replaced with a memcg-specific lock. This is an automatic translation at every site where the values involved were queried. The sites where values are written, however, used to be naturally called under cgroup_lock. This is the case for instance in the css_online callback. For those, we now need to explicitly add the memcg lock. With this, all the calls to cgroup_lock outside cgroup core are gone. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Currently, we use cgroups' provided list of children to verify if it is safe to proceed with any value change that is dependent on the cgroup being empty. This is less than ideal, because it enforces a dependency over cgroup core that we would be better off without. The solution proposed here is to iterate over the child cgroups and if any is found that is already online, we bounce and return: we don't really care how many children we have, only if we have any. This is also made to be hierarchy aware. IOW, cgroups with hierarchy disabled, while they still exist, will be considered for the purpose of this interface as having no children. [akpm@linux-foundation.org: tweak comments] Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This patch is a preparatory work for later locking rework to get rid of big cgroup lock from memory controller code. The memory controller uses some tunables to adjust its operation. Those tunables are inherited from parent to children upon children intialization. For most of them, the value cannot be changed after the parent has a new children. cgroup core splits initialization in two phases: css_alloc and css_online. After css_alloc, the memory allocation and basic initialization are done. But the new group is not yet visible anywhere, not even for cgroup core code. It is only somewhere between css_alloc and css_online that it is inserted into the internal children lists. Copying tunable values in css_alloc will lead to inconsistent values: the children will copy the old parent values, that can change between the copy and the moment in which the groups is linked to any data structure that can indicate the presence of children. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
In memcg, we use the cgroup_lock basically to synchronize against attaching new children to a cgroup. We do this because we rely on cgroup core to provide us with this information. We need to guarantee that upon child creation, our tunables are consistent. For those, the calls to cgroup_lock() all live in handlers like mem_cgroup_hierarchy_write(), where we change a tunable in the group that is hierarchy-related. For instance, the use_hierarchy flag cannot be changed if the cgroup already have children. Furthermore, those values are propagated from the parent to the child when a new child is created. So if we don't lock like this, we can end up with the following situation: A B memcg_css_alloc() mem_cgroup_hierarchy_write() copy use hierarchy from parent change use hierarchy in parent finish creation. This is mainly because during create, we are still not fully connected to the css tree. So all iterators and the such that we could use, will fail to show that the group has children. My observation is that all of creation can proceed in parallel with those tasks, except value assignment. So what this patch series does is to first move all value assignment that is dependent on parent values from css_alloc to css_online, where the iterators all work, and then we lock only the value assignment. This will guarantee that parent and children always have consistent values. Together with an online test, that can be derived from the observation that the refcount of an online memcg can be made to be always positive, we should be able to synchronize our side without the cgroup lock. This patch: Currently, we rely on the cgroup_lock() to prevent changes to move_charge_at_immigrate during task migration. However, this is only needed because the current strategy keeps checking this value throughout the whole process. Since all we need is serialization, one needs only to guarantee that whatever decision we made in the beginning of a specific migration is respected throughout the process. We can achieve this by just saving it in mc. By doing this, no kind of locking is needed. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Hiroyuki Kamezawa <kamezawa.hiroyuki@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
In order to maintain all the memcg bookkeeping, we need per-node descriptors, which will in turn contain a per-zone descriptor. Because we want to statically allocate those, this array ends up being very big. Part of the reason is that we allocate something large enough to hold MAX_NUMNODES, the compile time constant that holds the maximum number of nodes we would ever consider. However, we can do better in some cases if the firmware help us. This is true for modern x86 machines; coincidentally one of the architectures in which MAX_NUMNODES tends to be very big. By using the firmware-provided maximum number of nodes instead of MAX_NUMNODES, we can reduce the memory footprint of struct memcg considerably. In the extreme case in which we have only one node, this reduces the size of the structure from ~ 64k to ~2k. This is particularly important because it means that we will no longer resort to the vmalloc area for the struct memcg on defconfigs. We also have enough room for an extra node and still be outside vmalloc. One also has to keep in mind that with the industry's ability to fit more processors in a die as fast as the FED prints money, a nodes = 2 configuration is already respectably big. [akpm@linux-foundation.org: add check for invalid nid, remove inline] Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: NGreg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Memcg swap accounting is currently enabled by enable_swap_cgroup when the root cgroup is created. mem_cgroup_init acts as a memcg subsystem initializer which sounds like a much better place for enable_swap_cgroup as well. We already register memsw files from there so it makes a lot of sense to merge those two into a single enable_swap_cgroup function. This patch doesn't introduce any semantic changes. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Zhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Zhouping Liu has reported that memsw files are exported even though swap accounting is runtime disabled if MEMCG_SWAP is enabled. This behavior has been introduced by commit af36f906 ("memcg: always create memsw files if CGROUP_MEM_RES_CTLR_SWAP") and it causes any attempt to open the file to return EOPNOTSUPP. Although EOPNOTSUPP should say be clear that memsw operations are not supported in the given configuration it is fair to say that this behavior could be quite confusing. Let's tear memsw files out of default cgroup files and add them only if the swap accounting is really enabled (either by MEMCG_SWAP_ENABLED or swapaccount=1 boot parameter). We can hook into mem_cgroup_init which is called when the memcg subsystem is initialized and which happens after boot command line is processed. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reported-by: NZhouping Liu <zliu@redhat.com> Tested-by: NZhouping Liu <zliu@redhat.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Li Zefan <lizefan@huawei.com> Cc: CAI Qian <caiqian@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
When I use several fast SSD to do swap, swapper_space.tree_lock is heavily contended. This makes each swap partition have one address_space to reduce the lock contention. There is an array of address_space for swap. The swap entry type is the index to the array. In my test with 3 SSD, this increases the swapout throughput 20%. [akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache] Signed-off-by: NShaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Acked-by: NSha Zhengju <handai.szj@taobao.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sha Zhengju 提交于
Currently when a memcg oom is happening the oom dump messages is still global state and provides few useful info for users. This patch prints more pointed memcg page statistics for memcg-oom and take hierarchy into consideration: Based on Michal's advice, we take hierarchy into consideration: supppose we trigger an OOM on A's limit root_memcg | A (use_hierachy=1) / \ B C | D then the printed info will be: Memory cgroup stats for /A:... Memory cgroup stats for /A/B:... Memory cgroup stats for /A/C:... Memory cgroup stats for /A/B/D:... Following are samples of oom output: (1) Before change: mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fbfb>] dump_header+0x83/0x1ca ..... (call trace) [<ffffffff8168a818>] page_fault+0x28/0x30 <<<<<<<<<<<<<<<<<<<<< memcg specific information Task in /A/B/D killed as a result of limit of /A memory: usage 101376kB, limit 101376kB, failcnt 57 memory+swap: usage 101376kB, limit 101376kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat Mem-Info: Node 0 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 ...... CPU 3: hi: 0, btch: 1 usd: 0 Node 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 173 ...... CPU 3: hi: 186, btch: 31 usd: 130 <<<<<<<<<<<<<<<<<<<<< print global page state active_anon:92963 inactive_anon:40777 isolated_anon:0 active_file:33027 inactive_file:51718 isolated_file:0 unevictable:0 dirty:3 writeback:0 unstable:0 free:729995 slab_reclaimable:6897 slab_unreclaimable:6263 mapped:20278 shmem:35971 pagetables:5885 bounce:0 free_cma:0 <<<<<<<<<<<<<<<<<<<<< print per zone page state Node 0 DMA free:15836kB ... all_unreclaimable? no lowmem_reserve[]: 0 3175 3899 3899 Node 0 DMA32 free:2888564kB ... all_unrelaimable? no lowmem_reserve[]: 0 0 724 724 lowmem_reserve[]: 0 0 0 0 Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB 120710 total pagecache pages 0 pages in swap cache <<<<<<<<<<<<<<<<<<<<< print global swap cache stat Swap cache stats: add 0, delete 0, find 0/0 Free swap = 499708kB Total swap = 499708kB 1040368 pages RAM 58678 pages reserved 169065 pages shared 173632 pages non-shared [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2693] 0 2693 6005 1324 17 0 0 god [ 2754] 0 2754 6003 1320 16 0 0 god [ 2811] 0 2811 5992 1304 18 0 0 god [ 2874] 0 2874 6005 1323 18 0 0 god [ 2935] 0 2935 8720 7742 21 0 0 mal-30 [ 2976] 0 2976 21520 17577 42 0 0 mal-80 Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB We can see that messages dumped by show_free_areas() are longsome and can provide so limited info for memcg that just happen oom. (2) After change mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0 mal-80 cpuset=/ mems_allowed=0 Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10 Call Trace: [<ffffffff8167fd0b>] dump_header+0x83/0x1d1 .......(call trace) [<ffffffff8168a918>] page_fault+0x28/0x30 Task in /A/B/D killed as a result of limit of /A <<<<<<<<<<<<<<<<<<<<< memcg specific information memory: usage 102400kB, limit 102400kB, failcnt 140 memory+swap: usage 102400kB, limit 102400kB, failcnt 0 kmem: usage 0kB, limit 9007199254740991kB, failcnt 0 Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 2260] 0 2260 6006 1325 18 0 0 god [ 2383] 0 2383 6003 1319 17 0 0 god [ 2503] 0 2503 6004 1321 18 0 0 god [ 2622] 0 2622 6004 1321 16 0 0 god [ 2695] 0 2695 8720 7741 22 0 0 mal-30 [ 2704] 0 2704 21520 17839 43 0 0 mal-80 Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB This version provides more pointed info for memcg in "Memory cgroup stats for XXX" section. Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2013 1 次提交
-
-
由 Glauber Costa 提交于
The designed workflow for the caches in kmemcg is: register it with memcg_register_cache() if kmemcg is already available or later on when a new kmemcg appears at memcg_update_cache_sizes() which will handle all caches in the system. The caches created at boot time will be handled by the later, and the memcg-caches as well as any system caches that are registered later on by the former. There is a bug, however, in memcg_register_cache: we correctly set up the array size, but do not mark the cache as a root cache. This means that allocations for any cache appearing late in the game will see memcg->memcg_params->is_root_cache == false, and in particular, trigger VM_BUG_ON(!cachep->memcg_params->is_root_cache) in __memcg_kmem_cache_get. The obvious fix is to include the missing assignment. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 12月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Commit 648bb56d ("cgroup: lock cgroup_mutex in cgroup_init_subsys()") made cgroup_init_subsys() grab cgroup_mutex before invoking ->css_alloc() for the root css. Because memcg registers hotcpu notifier from ->css_alloc() for the root css, this introduced circular locking dependency between cgroup_mutex and cpu hotplug. Fix it by moving hotcpu notifier registration to a subsys initcall. ====================================================== [ INFO: possible circular locking dependency detected ] 3.7.0-rc4-work+ #42 Not tainted ------------------------------------------------------- bash/645 is trying to acquire lock: (cgroup_mutex){+.+.+.}, at: [<ffffffff8110c5b7>] cgroup_lock+0x17/0x20 but task is already holding lock: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (cpu_hotplug.lock){+.+.+.}: lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 get_online_cpus+0x3c/0x60 rebuild_sched_domains_locked+0x1b/0x70 cpuset_write_resmask+0x298/0x2c0 cgroup_file_write+0x1ef/0x300 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b -> #0 (cgroup_mutex){+.+.+.}: __lock_acquire+0x14ce/0x1d20 lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 cgroup_lock+0x17/0x20 cpuset_handle_hotplug+0x1b/0x560 cpuset_update_active_cpus+0xe/0x10 cpuset_cpu_inactive+0x47/0x50 notifier_call_chain+0x66/0x150 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x20/0x40 _cpu_down+0x7e/0x2f0 cpu_down+0x36/0x50 store_online+0x5d/0xe0 dev_attr_store+0x18/0x30 sysfs_write_file+0xe0/0x150 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(cpu_hotplug.lock); lock(cgroup_mutex); lock(cpu_hotplug.lock); lock(cgroup_mutex); *** DEADLOCK *** 5 locks held by bash/645: #0: (&buffer->mutex){+.+.+.}, at: [<ffffffff8123bab8>] sysfs_write_file+0x48/0x150 #1: (s_active#42){.+.+.+}, at: [<ffffffff8123bb38>] sysfs_write_file+0xc8/0x150 #2: (x86_cpu_hotplug_driver_mutex){+.+...}, at: [<ffffffff81079277>] cpu_hotplug_driver_lock+0x1 +7/0x20 #3: (cpu_add_remove_lock){+.+.+.}, at: [<ffffffff81093157>] cpu_maps_update_begin+0x17/0x20 #4: (cpu_hotplug.lock){+.+.+.}, at: [<ffffffff8109300f>] cpu_hotplug_begin+0x2f/0x60 stack backtrace: Pid: 645, comm: bash Not tainted 3.7.0-rc4-work+ #42 Call Trace: print_circular_bug+0x28e/0x29f __lock_acquire+0x14ce/0x1d20 lock_acquire+0x97/0x1e0 mutex_lock_nested+0x61/0x3b0 cgroup_lock+0x17/0x20 cpuset_handle_hotplug+0x1b/0x560 cpuset_update_active_cpus+0xe/0x10 cpuset_cpu_inactive+0x47/0x50 notifier_call_chain+0x66/0x150 __raw_notifier_call_chain+0xe/0x10 __cpu_notify+0x20/0x40 _cpu_down+0x7e/0x2f0 cpu_down+0x36/0x50 store_online+0x5d/0xe0 dev_attr_store+0x18/0x30 sysfs_write_file+0xe0/0x150 vfs_write+0xa8/0x160 sys_write+0x52/0xa0 system_call_fastpath+0x16/0x1b Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 12月, 2012 19 次提交
-
-
由 Glauber Costa 提交于
SLAB allows us to tune a particular cache behavior with tunables. When creating a new memcg cache copy, we'd like to preserve any tunables the parent cache already had. This could be done by an explicit call to do_tune_cpucache() after the cache is created. But this is not very convenient now that the caches are created from common code, since this function is SLAB-specific. Another method of doing that is taking advantage of the fact that do_tune_cpucache() is always called from enable_cpucache(), which is called at cache initialization. We can just preset the values, and then things work as expected. It can also happen that a root cache has its tunables updated during normal system operation. In this case, we will propagate the change to all caches that are already active. This change will require us to move the assignment of root_cache in memcg_params a bit earlier. We need this to be already set - which memcg_kmem_register_cache will do - when we reach __kmem_cache_create() Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
When we create caches in memcgs, we need to display their usage information somewhere. We'll adopt a scheme similar to /proc/meminfo, with aggregate totals shown in the global file, and per-group information stored in the group itself. For the time being, only reads are allowed in the per-group cache. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This means that when we destroy a memcg cache that happened to be empty, those caches may take a lot of time to go away: removing the memcg reference won't destroy them - because there are pending references, and the empty pages will stay there, until a shrinker is called upon for any reason. In this patch, we will call kmem_cache_shrink() for all dead caches that cannot be destroyed because of remaining pages. After shrinking, it is possible that it could be freed. If this is not the case, we'll schedule a lazy worker to keep trying. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This enables us to remove all the children of a kmem_cache being destroyed, if for example the kernel module it's being used in gets unloaded. Otherwise, the children will still point to the destroyed parent. Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Implement destruction of memcg caches. Right now, only caches where our reference counter is the last remaining are deleted. If there are any other reference counters around, we just leave the caches lying around until they go away. When that happens, a destruction function is called from the cache code. Caches are only destroyed in process context, so we queue them up for later processing in the general case. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We are able to match a cache allocation to a particular memcg. If the task doesn't change groups during the allocation itself - a rare event, this will give us a good picture about who is the first group to touch a cache page. This patch uses the now available infrastructure by calling memcg_kmem_get_cache() before all the cache allocations. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Create a mechanism that skip memcg allocations during certain pieces of our core code. It basically works in the same way as preempt_disable()/preempt_enable(): By marking a region under which all allocations will be accounted to the root memcg. We need this to prevent races in early cache creation, when we allocate data using caches that are not necessarily created already. Signed-off-by: NGlauber Costa <glommer@parallels.com> yCc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
The page allocator is able to bind a page to a memcg when it is allocated. But for the caches, we'd like to have as many objects as possible in a page belonging to the same cache. This is done in this patch by calling memcg_kmem_get_cache in the beginning of every allocation function. This function is patched out by static branches when kernel memory controller is not being used. It assumes that the task allocating, which determines the memcg in the page allocator, belongs to the same cgroup throughout the whole process. Misaccounting can happen if the task calls memcg_kmem_get_cache() while belonging to a cgroup, and later on changes. This is considered acceptable, and should only happen upon task migration. Before the cache is created by the memcg core, there is also a possible imbalance: the task belongs to a memcg, but the cache being allocated from is the global cache, since the child cache is not yet guaranteed to be ready. This case is also fine, since in this case the GFP_KMEMCG will not be passed and the page allocator will not attempt any cgroup accounting. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Every cache that is considered a root cache (basically the "original" caches, tied to the root memcg/no-memcg) will have an array that should be large enough to store a cache pointer per each memcg in the system. Theoreticaly, this is as high as 1 << sizeof(css_id), which is currently in the 64k pointers range. Most of the time, we won't be using that much. What goes in this patch, is a simple scheme to dynamically allocate such an array, in order to minimize memory usage for memcg caches. Because we would also like to avoid allocations all the time, at least for now, the array will only grow. It will tend to be big enough to hold the maximum number of kmem-limited memcgs ever achieved. We'll allocate it to be a minimum of 64 kmem-limited memcgs. When we have more than that, we'll start doubling the size of this array every time the limit is reached. Because we are only considering kmem limited memcgs, a natural point for this to happen is when we write to the limit. At that point, we already have set_limit_mutex held, so that will become our natural synchronization mechanism. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Allow a memcg parameter to be passed during cache creation. When the slub allocator is being used, it will only merge caches that belong to the same memcg. We'll do this by scanning the global list, and then translating the cache to a memcg-specific cache Default function is created as a wrapper, passing NULL to the memcg version. We only merge caches that belong to the same memcg. A helper is provided, memcg_css_id: because slub needs a unique cache name for sysfs. Since this is visible, but not the canonical location for slab data, the cache name is not used, the css_id should suffice. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
A lot of the initialization we do in mem_cgroup_create() is done with softirqs enabled. This include grabbing a css id, which holds &ss->id_lock->rlock, and the per-zone trees, which holds rtpz->lock->rlock. All of those signal to the lockdep mechanism that those locks can be used in SOFTIRQ-ON-W context. This means that the freeing of memcg structure must happen in a compatible context, otherwise we'll get a deadlock, like the one below, caught by lockdep: free_accounted_pages+0x47/0x4c free_task+0x31/0x5c __put_task_struct+0xc2/0xdb put_task_struct+0x1e/0x22 delayed_put_task_struct+0x7a/0x98 __rcu_process_callbacks+0x269/0x3df rcu_process_callbacks+0x31/0x5b __do_softirq+0x122/0x277 This usage pattern could not be triggered before kmem came into play. With the introduction of kmem stack handling, it is possible that we call the last mem_cgroup_put() from the task destructor, which is run in an rcu callback. Such callbacks are run with softirqs disabled, leading to the offensive usage pattern. In general, we have little, if any, means to guarantee in which context the last memcg_put will happen. The best we can do is test it and try to make sure no invalid context releases are happening. But as we add more code to memcg, the possible interactions grow in number and expose more ways to get context conflicts. One thing to keep in mind, is that part of the freeing process is already deferred to a worker, such as vfree(), that can only be called from process context. For the moment, the only two functions we really need moved away are: * free_css_id(), and * mem_cgroup_remove_from_trees(). But because the later accesses per-zone info, free_mem_cgroup_per_zone_info() needs to be moved as well. With that, we are left with the per_cpu stats only. Better move it all. Signed-off-by: NGlauber Costa <glommer@parallels.com> Tested-by: NGreg Thelen <gthelen@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Because the ultimate goal of the kmem tracking in memcg is to track slab pages as well, we can't guarantee that we'll always be able to point a page to a particular process, and migrate the charges along with it - since in the common case, a page will contain data belonging to multiple processes. Because of that, when we destroy a memcg, we only make sure the destruction will succeed by discounting the kmem charges from the user charges when we try to empty the cgroup. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We can use static branches to patch the code in or out when not used. Because the _ACTIVE bit on kmem_accounted is only set after the increment is done, we guarantee that the root memcg will always be selected for kmem charges until all call sites are patched (see memcg_kmem_enabled). This guarantees that no mischarges are applied. Static branch decrement happens when the last reference count from the kmem accounting in memcg dies. This will only happen when the charges drop down to 0. When that happens, we need to disable the static branch only on those memcgs that enabled it. To achieve this, we would be forced to complicate the code by keeping track of which memcgs were the ones that actually enabled limits, and which ones got it from its parents. It is a lot simpler just to do static_key_slow_inc() on every child that is accounted. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Because kmem charges can outlive the cgroup, we need to make sure that we won't free the memcg structure while charges are still in flight. For reviewing simplicity, the charge functions will issue mem_cgroup_get() at every charge, and mem_cgroup_put() at every uncharge. This can get expensive, however, and we can do better. mem_cgroup_get() only really needs to be issued once: when the first limit is set. In the same spirit, we only need to issue mem_cgroup_put() when the last charge is gone. We'll need an extra bit in kmem_account_flags for that: KMEM_ACCOUNTED_DEAD. it will be set when the cgroup dies, if there are charges in the group. If there aren't, we can proceed right away. Our uncharge function will have to test that bit every time the charges drop to 0. Because that is not the likely output of res_counter_uncharge, this should not impose a big hit on us: it is certainly much better than a reference count decrease at every operation. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Introduce infrastructure for tracking kernel memory pages to a given memcg. This will happen whenever the caller includes the flag __GFP_KMEMCG flag, and the task belong to a memcg other than the root. In memcontrol.h those functions are wrapped in inline acessors. The idea is to later on, patch those with static branches, so we don't incur any overhead when no mem cgroups with limited kmem are being used. Users of this functionality shall interact with the memcg core code through the following functions: memcg_kmem_newpage_charge: will return true if the group can handle the allocation. At this point, struct page is not yet allocated. memcg_kmem_commit_charge: will either revert the charge, if struct page allocation failed, or embed memcg information into page_cgroup. memcg_kmem_uncharge_page: called at free time, will revert the charge. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Add the basic infrastructure for the accounting of kernel memory. To control that, the following files are created: * memory.kmem.usage_in_bytes * memory.kmem.limit_in_bytes * memory.kmem.failcnt * memory.kmem.max_usage_in_bytes They have the same meaning of their user memory counterparts. They reflect the state of the "kmem" res_counter. Per cgroup kmem memory accounting is not enabled until a limit is set for the group. Once the limit is set the accounting cannot be disabled for that group. This means that after the patch is applied, no behavioral changes exists for whoever is still using memcg to control their memory usage, until memory.kmem.limit_in_bytes is set for the first time. We always account to both user and kernel resource_counters. This effectively means that an independent kernel limit is in place when the limit is set to a lower value than the user memory. A equal or higher value means that the user limit will always hit first, meaning that kmem is effectively unlimited. People who want to track kernel memory but not limit it, can set this limit to a very high number (like RESOURCE_MAX - 1page - that no one will ever hit, or equal to the user memory) [akpm@linux-foundation.org: MEMCG_MMEM only works with slab and slub] Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
This is just a cleanup patch for clarity of expression. In earlier submissions, people asked it to be in a separate patch, so here it is. Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Suleiman Souhlal 提交于
mem_cgroup_do_charge() was written before kmem accounting, and expects three cases: being called for 1 page, being called for a stock of 32 pages, or being called for a hugepage. If we call for 2 or 3 pages (and both the stack and several slabs used in process creation are such, at least with the debug options I had), it assumed it's being called for stock and just retried without reclaiming. Fix that by passing down a minsize argument in addition to the csize. And what to do about that (csize == PAGE_SIZE && ret) retry? If it's needed at all (and presumably is since it's there, perhaps to handle races), then it should be extended to more than PAGE_SIZE, yet how far? And should there be a retry count limit, of what? For now retry up to COSTLY_ORDER (as page_alloc.c does) and make sure not to do it if __GFP_NORETRY. v4: fixed nr pages calculation pointed out by Christoph Lameter. Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Suleiman Souhlal 提交于
We currently have a percpu stock cache scheme that charges one page at a time from memcg->res, the user counter. When the kernel memory controller comes into play, we'll need to charge more than that. This is because kernel memory allocations will also draw from the user counter, and can be bigger than a single page, as it is the case with the stack (usually 2 pages) or some higher order slabs. [glommer@parallels.com: added a changelog ] Signed-off-by: NSuleiman Souhlal <suleiman@google.com> Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2012 3 次提交
-
-
由 Michal Hocko 提交于
The mm given to __mem_cgroup_count_vm_event() cannot be NULL because the function is either called from the page fault path or vma->vm_mm is used. So the check can be dropped. The check was introduced by commit 456f998e ("memcg: add the pagefault count into memcg stats") because the originally proposed patch used current->mm for shmem but this has been changed to vma->vm_mm later on without the check being removed (thanks to Hugh for this recollection). Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
While profiling numa/core v16 with cgroup_disable=memory on the command line, I noticed mem_cgroup_count_vm_event() still showed up as high as 0.60% in perftop. This occurs because the function is called extremely often even when memcg is disabled. To fix this, inline the check for mem_cgroup_disabled() so we avoid the unnecessary function call if memcg is disabled. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NGlauber Costa <glommer@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lai Jiangshan 提交于
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 1 次提交
-
-
由 David Rientjes 提交于
mem_cgroup_out_of_memory() is only referenced from within file scope, so it can be marked static. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
Note: This is very heavily based on a patch from Peter Zijlstra with fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner. That patch put a lot of migration logic into mm/huge_memory.c where it does not belong. This version puts tries to share some of the migration logic with migrate_misplaced_page. However, it should be noted that now migrate.c is doing more with the pagetable manipulation than is preferred. The end result is barely recognisable so as before, the signed-offs had to be removed but will be re-added if the original authors are ok with it. Add THP migration for the NUMA working set scanning fault case. It uses the page lock to serialize. No migration pte dance is necessary because the pte is already unmapped when we decide to migrate. [dhillf@gmail.com: Fix memory leak on isolation failure] [dhillf@gmail.com: Fix transfer of last_nid information] Signed-off-by: NMel Gorman <mgorman@suse.de>
-
- 20 11月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Rename cgroup_subsys css lifetime related callbacks to better describe what their roles are. Also, update documentation. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com>
-
- 17 11月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
When MEMCG is configured on (even when it's disabled by boot option), when adding or removing a page to/from its lru list, the zone pointer used for stats updates is nowadays taken from the struct lruvec. (On many configurations, calculating zone from page is slower.) But we have no code to update all the lruvecs (per zone, per memcg) when a memory node is hotadded. Here's an extract from the oops which results when running numactl to bind a program to a newly onlined node: BUG: unable to handle kernel NULL pointer dereference at 0000000000000f60 IP: __mod_zone_page_state+0x9/0x60 Pid: 1219, comm: numactl Not tainted 3.6.0-rc5+ #180 Bochs Bochs Process numactl (pid: 1219, threadinfo ffff880039abc000, task ffff8800383c4ce0) Call Trace: __pagevec_lru_add_fn+0xdf/0x140 pagevec_lru_move_fn+0xb1/0x100 __pagevec_lru_add+0x1c/0x30 lru_add_drain_cpu+0xa3/0x130 lru_add_drain+0x2f/0x40 ... The natural solution might be to use a memcg callback whenever memory is hotadded; but that solution has not been scoped out, and it happens that we do have an easy location at which to update lruvec->zone. The lruvec pointer is discovered either by mem_cgroup_zone_lruvec() or by mem_cgroup_page_lruvec(), and both of those do know the right zone. So check and set lruvec->zone in those; and remove the inadequate attempt to set lruvec->zone from lruvec_init(), which is called before NODE_DATA(node) has been allocated in such cases. Ah, there was one exceptionr. For no particularly good reason, mem_cgroup_force_empty_list() has its own code for deciding lruvec. Change it to use the standard mem_cgroup_zone_lruvec() and mem_cgroup_get_lru_size() too. In fact it was already safe against such an oops (the lru lists in danger could only be empty), but we're better proofed against future changes this way. I've marked this for stable (3.6) since we introduced the problem in 3.5 (now closed to stable); but I have no idea if this is the only fix needed to get memory hotadd working with memcg in 3.6, and received no answer when I enquired twice before. Reported-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-