- 15 1月, 2016 10 次提交
-
-
由 Johannes Weiner 提交于
There won't be any separate counters for socket memory consumed by protocols other than TCP in the future. Remove the indirection and link sockets directly to their owning memory cgroup. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
There won't be a tcp control soft limit, so integrating the memcg code into the global skmem limiting scheme complicates things unnecessarily. Replace this with simple and clear charge and uncharge calls--hidden behind a jump label--to account skb memory. Note that this is not purely aesthetic: as a result of shoehorning the per-memcg code into the same memory accounting functions that handle the global level, the old code would compare the per-memcg consumption against the smaller of the per-memcg limit and the global limit. This allowed the total consumption of multiple sockets to exceed the global limit, as long as the individual sockets stayed within bounds. After this change, the code will always compare the per-memcg consumption to the per-memcg limit, and the global consumption to the global limit, and thus close this loophole. Without a soft limit, the per-memcg memory pressure state in sockets is generally questionable. However, we did it until now, so we continue to enter it when the hard limit is hit, and packets are dropped, to let other sockets in the cgroup know that they shouldn't grow their transmit windows, either. However, keep it simple in the new callback model and leave memory pressure lazily when the next packet is accepted (as opposed to doing it synchroneously when packets are processed). When packets are dropped, network performance will already be in the toilet, so that should be a reasonable trade-off. As described above, consumption is now checked on the per-memcg level and the global level separately. Likewise, memory pressure states are maintained on both the per-memcg level and the global level, and a socket is considered under pressure when either level asserts as much. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
tcp_memcontrol replicates the global sysctl_mem limit array per cgroup, but it only ever sets these entries to the value of the memory_allocated page_counter limit. Use the latter directly. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The number of allocated sockets is used for calculations in the soft limit phase, where packets are accepted but the socket is under memory pressure. Since there is no soft limit phase in tcp_memcontrol, and memory pressure is only entered when packets are already dropped, this is actually dead code. Remove it. As this is the last user of parent_cg_proto(), remove that too. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Move the jump-label from sock_update_memcg() and sock_release_memcg() to the callsite, and so eliminate those function calls when socket accounting is not enabled. This also eliminates the need for dummy functions because the calls will be optimized away if the Kconfig options are not enabled. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
A later patch will need this symbol in files other than memcontrol.c, so export it now and replace mem_cgroup_root_css at the same time. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
There are two bits defined for cg_proto->flags - MEMCG_SOCK_ACTIVATED and MEMCG_SOCK_ACTIVE - both are set in tcp_update_limit, but the former is never cleared while the latter can be cleared by unsetting the limit. This allows to disable tcp socket accounting for new sockets after it was enabled by writing -1 to memory.kmem.tcp.limit_in_bytes while still guaranteeing that memcg_socket_limit_enabled static key will be decremented on memcg destruction. This functionality looks dubious, because it is not clear what a use case would be. By enabling tcp accounting a user accepts the price. If they then find the performance degradation unacceptable, they can always restart their workload with tcp accounting disabled. It does not seem there is any need to flip it while the workload is running. Besides, it contradicts to how kmem accounting API works: writing whatever to memory.kmem.limit_in_bytes enables kmem accounting for the cgroup in question, after which it cannot be disabled. Therefore one might expect that writing -1 to memory.kmem.tcp.limit_in_bytes just enables socket accounting w/o limiting it, which might be useful by itself, but it isn't true. Since this API peculiarity is not documented anywhere, I propose to drop it. This will allow to simplify the code by dropping cg_proto->flags. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Currently, if we want to account all objects of a particular kmem cache, we have to pass __GFP_ACCOUNT to each kmem_cache_alloc call, which is inconvenient. This patch introduces SLAB_ACCOUNT flag which if passed to kmem_cache_create will force accounting for every allocation from this cache even if __GFP_ACCOUNT is not passed. This patch does not make any of the existing caches use this flag - it will be done later in the series. Note, a cache with SLAB_ACCOUNT cannot be merged with a cache w/o SLAB_ACCOUNT, because merged caches share the same kmem_cache struct and hence cannot have different sets of SLAB_* flags. Thus using this flag will probably reduce the number of merged slabs even if kmem accounting is not used (only compiled in). Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Suggested-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So this patch switches kmem accounting to the white-policy: now only those kmem allocations that are marked as __GFP_ACCOUNT are accounted to memcg. Currently, no kmem allocations are marked like this. The following patches will mark several kmem allocations that are known to be easily triggered from userspace and therefore should be accounted to memcg. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This reverts commit 8f4fc071 ("gfp: add __GFP_NOACCOUNT"). Black-list kmem accounting policy (aka __GFP_NOACCOUNT) turned out to be fragile and difficult to maintain, because there seem to be many more allocations that should not be accounted than those that should be. Besides, false accounting an allocation might result in much worse consequences than not accounting at all, namely increased memory consumption due to pinned dead kmem caches. So it was decided to switch to the white-list policy. This patch reverts bits introducing the black-list policy. The white-list policy will be introduced later in the series. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Greg Thelen <gthelen@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 11月, 2015 9 次提交
-
-
由 Hugh Dickins 提交于
After v4.3's commit 0610c25d ("memcg: fix dirty page migration") mem_cgroup_migrate() doesn't have much to offer in page migration: convert migrate_misplaced_transhuge_page() to set_page_memcg() instead. Then rename mem_cgroup_migrate() to mem_cgroup_replace_page(), since its remaining callers are replace_page_cache_page() and shmem_replace_page(): both of whom passed lrucare true, so just eliminate that argument. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Before the previous patch ("memcg: unify slab and other kmem pages charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab pages and pages allocated with alloc_kmem_pages - memcg in the page struct. Now we can unify it. Since after it, this function becomes tiny we can fold it into mem_cgroup_from_kmem. [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c] Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and uncharging kmem pages to memcg, but currently they are not used for charging slab pages (i.e. they are only used for charging pages allocated with alloc_kmem_pages). The only reason why the slab subsystem uses special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it needs to charge to the memcg of kmem cache while memcg_charge_kmem charges to the memcg that the current task belongs to. To remove this diversity, this patch adds an extra argument to __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is not NULL, the function tries to charge to the memcg it points to, otherwise it charge to the current context. Next, it makes the slab subsystem use this function to charge slab pages. Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since __memcg_kmem_charge stores a pointer to the memcg in the page struct, we don't need memcg_uncharge_slab anymore and can use free_kmem_pages. Besides, one can now detect which memcg a slab page belongs to by reading /proc/kpagecgroup. Note, this patch switches slab to charge-after-alloc design. Since this design is already used for all other memcg charges, it should not make any difference. [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup] Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
Charging kmem pages proceeds in two steps. First, we try to charge the allocation size to the memcg the current task belongs to, then we allocate a page and "commit" the charge storing the pointer to the memcg in the page struct. Such a design looks overcomplicated, because there is not much sense in trying charging the allocation before actually allocating a page: we won't be able to consume much memory over the limit even if we charge after doing the actual allocation, besides we already charge user pages post factum, so being pedantic with kmem pages just looks pointless. So this patch simplifies the design by merging the "charge" and the "commit" steps into the same function, which takes the allocated page. Also, rename the charge and uncharge methods to memcg_kmem_charge and memcg_kmem_uncharge and make the charge method return error code instead of bool to conform to mem_cgroup_try_charge. Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yaowei Bai 提交于
Make mem_cgroup_inactive_anon_is_low return bool due to this particular function only using either one or zero as its return value. No functional change. Signed-off-by: NYaowei Bai <bywxiaobai@163.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
__memcg_kmem_bypass() decides whether a kmem allocation should be bypassed to the root memcg. Some conditions that it tests are valid criteria regarding who should be held accountable; however, there are a couple unnecessary tests for cold paths - __GFP_FAIL and fatal_signal_pending(). The previous patch updated try_charge() to handle both __GFP_FAIL and dying tasks correctly and the only thing these two tests are doing is making accounting less accurate and sprinkling tests for cold path conditions in the hot paths. There's nothing meaningful gained by these extra tests. This patch removes the two unnecessary tests from __memcg_kmem_bypass(). Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
memcg_kmem_newpage_charge() and memcg_kmem_get_cache() are testing the same series of conditions to decide whether to bypass kmem accounting. Collect the tests into __memcg_kmem_bypass(). This is pure refactoring. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
Currently, try_charge() tries to reclaim memory synchronously when the high limit is breached; however, if the allocation doesn't have __GFP_WAIT, synchronous reclaim is skipped. If a process performs only speculative allocations, it can blow way past the high limit. This is actually easily reproducible by simply doing "find /". slab/slub allocator tries speculative allocations first, so as long as there's memory which can be consumed without blocking, it can keep allocating memory regardless of the high limit. This patch makes try_charge() always punt the over-high reclaim to the return-to-userland path. If try_charge() detects that high limit is breached, it adds the overage to current->memcg_nr_pages_over_high and schedules execution of mem_cgroup_handle_over_high() which performs synchronous reclaim from the return-to-userland path. As long as kernel doesn't have a run-away allocation spree, this should provide enough protection while making kmemcg behave more consistently. It also has the following benefits. - All over-high reclaims can use GFP_KERNEL regardless of the specific gfp mask in use, e.g. GFP_NOFS, when the limit was breached. - It copes with prio inversion. Previously, a low-prio task with small memory.high might perform over-high reclaim with a bunch of locks held. If a higher prio task needed any of these locks, it would have to wait until the low prio task finished reclaim and released the locks. By handing over-high reclaim to the task exit path this issue can be avoided. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@kernel.org> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
task_struct->memcg_oom is a sub-struct containing fields which are used for async memcg oom handling. Most task_struct fields aren't packaged this way and it can lead to unnecessary alignment paddings. This patch flattens it. * task.memcg_oom.memcg -> task.memcg_in_oom * task.memcg_oom.gfp_mask -> task.memcg_oom_gfp_mask * task.memcg_oom.order -> task.memcg_oom_order * task.memcg_oom.may_oom -> task.memcg_may_oom In addition, task.memcg_may_oom is relocated to where other bitfields are which reduces the size of task_struct. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 10月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
For memcg domains, the amount of available memory was calculated as min(the amount currently in use + headroom according to memcg, total clean memory) This isn't quite correct as what should be capped by the amount of clean memory is the headroom, not the sum of memory in use and headroom. For example, if a memcg domain has a significant amount of dirty memory, the above can lead to a value which is lower than the current amount in use which doesn't make much sense. In most circumstances, the above leads to a number which is somewhat but not drastically lower. As the amount of memory which can be readily allocated to the memcg domain is capped by the amount of system-wide clean memory which is not already assigned to the memcg itself, the number we want is the amount currently in use + min(headroom according to memcg, clean memory elsewhere in the system) This patch updates mem_cgroup_wb_stats() to return the number of filepages and headroom instead of the calculated available pages. mdtc_cap_avail() is renamed to mdtc_calc_avail() and performs the above calculation from file, headroom, dirty and globally clean pages. v2: Dummy mem_cgroup_wb_stats() implementation wasn't updated leading to build failure when !CGROUP_WRITEBACK. Fixed. Signed-off-by: NTejun Heo <tj@kernel.org> Fixes: c2aa723a ("writeback: implement memcg writeback domain based throttling") Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 10月, 2015 1 次提交
-
-
由 Greg Thelen 提交于
Commit 733a572e ("memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online") removed the last use of the per memcg pcp_counter_lock but forgot to remove the variable. Kill the vestigial variable. Signed-off-by: NGreg Thelen <gthelen@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
cgroup core only recently grew generic notification support. Wire up "memory.events" so that it triggers a file modified event whenever its content changes. v2: Refreshed on top of mem_cgroup relocation. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NMichal Hocko <mhocko@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com>
-
- 18 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
Replace cgroup_subsys->disabled tests in controllers with cgroup_subsys_enabled(). cgroup_subsys_enabled() requires literal subsys name as its parameter and thus can't be used for cgroup core which iterates through controllers. For cgroup core, introduce and use cgroup_ssid_enabled() which uses slower static_key_enabled() test and can be indexed by subsys ID. This leaves cgroup_subsys->disabled unused. Removed. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
-
- 11 9月, 2015 2 次提交
-
-
由 Vladimir Davydov 提交于
It is only used in mem_cgroup_try_charge, so fold it in and zap it. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This patchset introduces a new user API for tracking user memory pages that have not been used for a given period of time. The purpose of this is to provide the userspace with the means of tracking a workload's working set, i.e. the set of pages that are actively used by the workload. Knowing the working set size can be useful for partitioning the system more efficiently, e.g. by tuning memory cgroup limits appropriately, or for job placement within a compute cluster. ==== USE CASES ==== The unified cgroup hierarchy has memory.low and memory.high knobs, which are defined as the low and high boundaries for the workload working set size. However, the working set size of a workload may be unknown or change in time. With this patch set, one can periodically estimate the amount of memory unused by each cgroup and tune their memory.low and memory.high parameters accordingly, therefore optimizing the overall memory utilization. Another use case is balancing workloads within a compute cluster. Knowing how much memory is not really used by a workload unit may help take a more optimal decision when considering migrating the unit to another node within the cluster. Also, as noted by Minchan, this would be useful for per-process reclaim (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle pages only by smart user memory manager. ==== USER API ==== The user API consists of two new files: * /sys/kernel/mm/page_idle/bitmap. This file implements a bitmap where each bit corresponds to a page, indexed by PFN. When the bit is set, the corresponding page is idle. A page is considered idle if it has not been accessed since it was marked idle. To mark a page idle one should set the bit corresponding to the page by writing to the file. A value written to the file is OR-ed with the current bitmap value. Only user memory pages can be marked idle, for other page types input is silently ignored. Writing to this file beyond max PFN results in the ENXIO error. Only available when CONFIG_IDLE_PAGE_TRACKING is set. This file can be used to estimate the amount of pages that are not used by a particular workload as follows: 1. mark all pages of interest idle by setting corresponding bits in the /sys/kernel/mm/page_idle/bitmap 2. wait until the workload accesses its working set 3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set * /proc/kpagecgroup. This file contains a 64-bit inode number of the memory cgroup each page is charged to, indexed by PFN. Only available when CONFIG_MEMCG is set. This file can be used to find all pages (including unmapped file pages) accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one can then estimate the cgroup working set size. For an example of using these files for estimating the amount of unused memory pages per each memory cgroup, please see the script attached below. ==== REASONING ==== The reason to introduce the new user API instead of using /proc/PID/{clear_refs,smaps} is that the latter has two serious drawbacks: - it does not count unmapped file pages - it affects the reclaimer logic The new API attempts to overcome them both. For more details on how it is achieved, please see the comment to patch 6. ==== PATCHSET STRUCTURE ==== The patch set is organized as follows: - patch 1 adds page_cgroup_ino() helper for the sake of /proc/kpagecgroup and patches 2-3 do related cleanup - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is charged to - patch 5 introduces a new mmu notifier callback, clear_young, which is a lightweight version of clear_flush_young; it is used in patch 6 - patch 6 implements the idle page tracking feature, including the userspace API, /sys/kernel/mm/page_idle/bitmap - patch 7 exports idle flag via /proc/kpageflags ==== SIMILAR WORKS ==== Originally, the patch for tracking idle memory was proposed back in 2011 by Michel Lespinasse (see http://lwn.net/Articles/459269/). The main difference between Michel's patch and this one is that Michel implemented a kernel space daemon for estimating idle memory size per cgroup while this patch only provides the userspace with the minimal API for doing the job, leaving the rest up to the userspace. However, they both share the same idea of Idle/Young page flags to avoid affecting the reclaimer logic. ==== PERFORMANCE EVALUATION ==== SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the performance impact introduced by this patch set. Three runs were carried out: - base: kernel without the patch - patched: patched kernel, the feature is not used - patched-active: patched kernel, 1 minute-period daemon is used for tracking idle memory For tracking idle memory, idlememstat utility was used: https://github.com/locker/idlememstat testcase base patched patched-active compiler 537.40 ( 0.00)% 532.26 (-0.96)% 538.31 ( 0.17)% compress 305.47 ( 0.00)% 301.08 (-1.44)% 300.71 (-1.56)% crypto 284.32 ( 0.00)% 282.21 (-0.74)% 284.87 ( 0.19)% derby 411.05 ( 0.00)% 413.44 ( 0.58)% 412.07 ( 0.25)% mpegaudio 189.96 ( 0.00)% 190.87 ( 0.48)% 189.42 (-0.28)% scimark.large 46.85 ( 0.00)% 46.41 (-0.94)% 47.83 ( 2.09)% scimark.small 412.91 ( 0.00)% 415.41 ( 0.61)% 421.17 ( 2.00)% serial 204.23 ( 0.00)% 213.46 ( 4.52)% 203.17 (-0.52)% startup 36.76 ( 0.00)% 35.49 (-3.45)% 35.64 (-3.05)% sunflow 115.34 ( 0.00)% 115.08 (-0.23)% 117.37 ( 1.76)% xml 620.55 ( 0.00)% 619.95 (-0.10)% 620.39 (-0.03)% composite 211.50 ( 0.00)% 211.15 (-0.17)% 211.67 ( 0.08)% time idlememstat: 17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k 448inputs+40outputs (1major+36052minor)pagefaults 0swaps ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ==== #! /usr/bin/python # import os import stat import errno import struct CGROUP_MOUNT = "/sys/fs/cgroup/memory" BUFSIZE = 8 * 1024 # must be multiple of 8 def get_hugepage_size(): with open("/proc/meminfo", "r") as f: for s in f: k, v = s.split(":") if k == "Hugepagesize": return int(v.split()[0]) * 1024 PAGE_SIZE = os.sysconf("SC_PAGE_SIZE") HUGEPAGE_SIZE = get_hugepage_size() def set_idle(): f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE) while True: try: f.write(struct.pack("Q", pow(2, 64) - 1)) except IOError as err: if err.errno == errno.ENXIO: break raise f.close() def count_idle(): f_flags = open("/proc/kpageflags", "rb", BUFSIZE) f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE) with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f: while f.read(BUFSIZE): pass # update idle flag idlememsz = {} while True: s1, s2 = f_flags.read(8), f_cgroup.read(8) if not s1 or not s2: break flags, = struct.unpack('Q', s1) cgino, = struct.unpack('Q', s2) unevictable = (flags >> 18) & 1 huge = (flags >> 22) & 1 idle = (flags >> 25) & 1 if idle and not unevictable: idlememsz[cgino] = idlememsz.get(cgino, 0) + \ (HUGEPAGE_SIZE if huge else PAGE_SIZE) f_flags.close() f_cgroup.close() return idlememsz if __name__ == "__main__": print "Setting the idle flag for each page..." set_idle() raw_input("Wait until the workload accesses its working set, " "then press Enter") print "Counting idle pages..." idlememsz = count_idle() for dir, subdirs, files in os.walk(CGROUP_MOUNT): ino = os.stat(dir)[stat.ST_INO] print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB" ==== END SCRIPT ==== This patch (of 8): Add page_cgroup_ino() helper to memcg. This function returns the inode number of the closest online ancestor of the memory cgroup a page is charged to. It is required for exporting information about which page is charged to which cgroup to userspace, which will be introduced by a following patch. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Michel Lespinasse <walken@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 9月, 2015 3 次提交
-
-
由 Michal Hocko 提交于
Most of the exported functions in this header are not marked extern so change the rest to follow the same style. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
The only user is cgwb_bdi_init and that one depends on CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG so it doesn't make much sense to definte an empty stub for !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead to runtime crashes if used in unguarded code paths. Better fail during compilation. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
mem_cgroup structure is defined in mm/memcontrol.c currently which means that the code outside of this file has to use external API even for trivial access stuff. This patch exports mm_struct with its dependencies and makes some of the exported functions inlines. This even helps to reduce the code size a bit (make defconfig + CONFIG_MEMCG=y) text data bss dec hex filename 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after This is not much (370B) but better than nothing. We also save a function call in some hot paths like callers of mem_cgroup_count_vm_event which is used for accounting. The patch doesn't introduce any functional changes. [vdavykov@parallels.com: inline memcg_kmem_is_active] [vdavykov@parallels.com: do not expose type outside of CONFIG_MEMCG] [akpm@linux-foundation.org: memcontrol.h needs eventfd.h for eventfd_ctx] [akpm@linux-foundation.org: export mem_cgroup_from_task() to modules] Signed-off-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NVladimir Davydov <vdavydov@parallels.com> Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 6月, 2015 6 次提交
-
-
由 Tejun Heo 提交于
While cgroup writeback support now connects memcg and blkcg so that writeback IOs are properly attributed and controlled, the IO back pressure propagation mechanism implemented in balance_dirty_pages() and its subroutines wasn't aware of cgroup writeback. Processes belonging to a memcg may have access to only subset of total memory available in the system and not factoring this into dirty throttling rendered it completely ineffective for processes under memcg limits and memcg ended up building a separate ad-hoc degenerate mechanism directly into vmscan code to limit page dirtying. The previous patches updated balance_dirty_pages() and its subroutines so that they can deal with multiple wb_domain's (writeback domains) and defined per-memcg wb_domain. Processes belonging to a non-root memcg are bound to two wb_domains, global wb_domain and memcg wb_domain, and should be throttled according to IO pressures from both domains. This patch updates dirty throttling code so that it repeats similar calculations for the two domains - the differences between the two are few and minor - and applies the lower of the two sets of resulting constraints. wb_over_bg_thresh(), which controls when background writeback terminates, is also updated to consider both global and memcg wb_domains. It returns true if dirty is over bg_thresh for either domain. This makes the dirty throttling mechanism operational for memcg domains including writeback-bandwidth-proportional dirty page distribution inside them but the ad-hoc memcg throttling mechanism in vmscan is still in place. The next patch will rip it out. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Dirtyable memory is distributed to a wb (bdi_writeback) according to the relative bandwidth the wb is writing out in the whole system. This distribution is global - each wb is measured against all other wb's and gets the proportinately sized portion of the memory in the whole system. For cgroup writeback, the amount of dirtyable memory is scoped by memcg and thus each wb would need to be measured and controlled in its memcg. IOW, a wb will belong to two writeback domains - the global and memcg domains. The previous patches laid the groundwork to support the two wb_domains and this patch implements memcg wb_domain. memcg->cgwb_domain is initialized on css online and destroyed on css release, wb->memcg_completions is added, and __wb_writeout_inc() is updated to increment completions against both global and memcg wb_domains. The following patches will update balance_dirty_pages() and its subroutines to actually consider memcg wb_domain for throttling. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
For the planned cgroup writeback support, on each bdi (backing_dev_info), each memcg will be served by a separate wb (bdi_writeback). This patch updates bdi so that a bdi can host multiple wbs (bdi_writebacks). On the default hierarchy, blkcg implicitly enables memcg. This allows using memcg's page ownership for attributing writeback IOs, and every memcg - blkcg combination can be served by its own wb by assigning a dedicated wb to each memcg. This means that there may be multiple wb's of a bdi mapped to the same blkcg. As congested state is per blkcg - bdi combination, those wb's should share the same congested state. This is achieved by tracking congested state via bdi_writeback_congested structs which are keyed by blkcg. bdi->wb remains unchanged and will keep serving the root cgroup. cgwb's (cgroup wb's) for non-root cgroups are created on-demand or looked up while dirtying an inode according to the memcg of the page being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree by its memcg id. Once an inode is associated with its wb, it can be retrieved using inode_to_wb(). Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all pages will keep being associated with bdi->wb. v3: inode_attach_wb() in account_page_dirtied() moved inside mapping_cap_account_dirty() block where it's known to be !NULL. Also, an unnecessary NULL check before kfree() removed. Both detected by the kbuild bot. v2: Updated so that wb association is per inode and wb is per memcg rather than blkcg. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: kbuild test robot <fengguang.wu@intel.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Implement mem_cgroup_css_from_page() which returns the cgroup_subsys_state of the memcg associated with a given page on the default hierarchy. This will be used by cgroup writeback support. This function assumes that page->mem_cgroup association doesn't change until the page is released, which is true on the default hierarchy as long as replace_page_cache_page() is not used. As the only user of replace_page_cache_page() is FUSE which won't support cgroup writeback for the time being, this works for now, and replace_page_cache_page() will soon be updated so that the invariant actually holds. Note that the RCU protected page->mem_cgroup access is consistent with other usages across memcg but ultimately incorrect. These unlocked accesses are missing required barriers. page->mem_cgroup should be made an RCU pointer and updated and accessed using RCU operations. v4: Instead of triggering WARN, return the root css on the traditional hierarchies. This makes the function a lot easier to deal with especially as there's no light way to synchronize against hierarchy rebinding. v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/ v2: Trigger WARN if the function is used on the traditional hierarchies and add comment about the assumed invariant. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Tejun Heo 提交于
Add global mem_cgroup_root_css which points to the root memcg css. This will be used by cgroup writeback support. If memcg is disabled, it's defined as ERR_PTR(-EINVAL). Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> aCc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Greg Thelen 提交于
When modifying PG_Dirty on cached file pages, update the new MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where global NR_FILE_DIRTY is managed. The new memcg stat is visible in the per memcg memory.stat cgroupfs file. The most recent past attempt at this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632 The new accounting supports future efforts to add per cgroup dirty page throttling and writeback. It also helps an administrator break down a container's memory usage and provides evidence to understand memcg oom kills (the new dirty count is included in memcg oom kill messages). The ability to move page accounting between memcg (memory.move_charge_at_immigrate) makes this accounting more complicated than the global counter. The existing mem_cgroup_{begin,end}_page_stat() lock is used to serialize move accounting with stat updates. Typical update operation: memcg = mem_cgroup_begin_page_stat(page) if (TestSetPageDirty()) { [...] mem_cgroup_update_page_stat(memcg) } mem_cgroup_end_page_stat(memcg) Summary of mem_cgroup_end_page_stat() overhead: - Without CONFIG_MEMCG it's a no-op - With CONFIG_MEMCG and no inter memcg task movement, it's just rcu_read_lock() - With CONFIG_MEMCG and inter memcg task movement, it's rcu_read_lock() + spin_lock_irqsave() A memcg parameter is added to several routines because their callers now grab mem_cgroup_begin_page_stat() which returns the memcg later needed by for mem_cgroup_update_page_stat(). Because mem_cgroup_begin_page_stat() may disable interrupts, some adjustments are needed: - move __mark_inode_dirty() from __set_page_dirty() to its caller. __mark_inode_dirty() locking does not want interrupts disabled. - use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in __delete_from_page_cache(), replace_page_cache_page(), invalidate_complete_page2(), and __remove_mapping(). text data bss dec hex filename 8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before 8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after +192 text bytes 8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before 8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after +773 text bytes Performance tests run on v4.0-rc1-36-g4f671fe2. Lower is better for all metrics, they're all wall clock or cycle counts. The read and write fault benchmarks just measure fault time, they do not include I/O time. * CONFIG_MEMCG not set: baseline patched kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples) dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03% dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99% dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77% read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples) write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples) * CONFIG_MEMCG=y root_memcg: baseline patched kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples) dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90% dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33% dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00% read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples) write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples) * CONFIG_MEMCG=y non-root_memcg: baseline patched kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples) dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82% dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27% dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52% read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples) write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples) As expected anon page faults are not affected by this patch. tj: Updated to apply on top of the recent cancel_dirty_page() changes. Signed-off-by: NSha Zhengju <handai.szj@gmail.com> Signed-off-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 15 5月, 2015 1 次提交
-
-
由 Vladimir Davydov 提交于
Not all kmem allocations should be accounted to memcg. The following patch gives an example when accounting of a certain type of allocations to memcg can effectively result in a memory leak. This patch adds the __GFP_NOACCOUNT flag which if passed to kmalloc and friends will force the allocation to go through the root cgroup. It will be used by the next patch. Note, since in case of kmemleak enabled each kmalloc implies yet another allocation from the kmemleak_object cache, we add __GFP_NOACCOUNT to gfp_kmemleak_mask. Alternatively, we could introduce a per kmem cache flag disabling accounting for all allocations of a particular kind, but (a) we would not be able to bypass accounting for kmalloc then and (b) a kmem cache with this flag set could not be merged with a kmem cache without this flag, which would increase the number of global caches and therefore fragmentation even if the memory cgroup controller is not used. Despite its generic name, currently __GFP_NOACCOUNT disables accounting only for kmem allocations while user page allocations are always charged. To catch abusing of this flag, a warning is issued on an attempt of passing it to mem_cgroup_try_charge. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Tejun Heo <tj@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Thelen <gthelen@google.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: <stable@vger.kernel.org> [4.0.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 4 次提交
-
-
由 Vladimir Davydov 提交于
There are several FS shrinkers, including super_block::s_shrink, that keep reclaimable objects in the list_lru structure. Hence to turn them to memcg-aware shrinkers, it is enough to make list_lru per-memcg. This patch does the trick. It adds an array of lru lists to the list_lru_node structure (per-node part of the list_lru), one for each kmem-active memcg, and dispatches every item addition or removal to the list corresponding to the memcg which the item is accounted to. So now the list_lru structure is not just per node, but per node and per memcg. Not all list_lrus need this feature, so this patch also adds a new method, list_lru_init_memcg, which initializes a list_lru as memcg aware. Otherwise (i.e. if initialized with old list_lru_init), the list_lru won't have per memcg lists. Just like per memcg caches arrays, the arrays of per-memcg lists are indexed by memcg_cache_id, so we must grow them whenever memcg_nr_cache_ids is increased. So we introduce a callback, memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id space is full. The locking is implemented in a manner similar to lruvecs, i.e. we have one lock per node that protects all lists (both global and per cgroup) on the node. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
We need a stable value of memcg_nr_cache_ids in kmem_cache_create() (memcg_alloc_cache_params() wants it for root caches), where we only hold the slab_mutex and no memcg-related locks. As a result, we have to update memcg_nr_cache_ids under the slab_mutex, which we can only take on the slab's side (see memcg_update_array_size). This looks awkward and will become even worse when per-memcg list_lru is introduced, which also wants stable access to memcg_nr_cache_ids. To get rid of this dependency between the memcg_nr_cache_ids and the slab_mutex, this patch introduces a special rwsem. The rwsem is held for writing during memcg_caches arrays relocation and memcg_nr_cache_ids updates. Therefore one can take it for reading to get a stable access to memcg_caches arrays and/or memcg_nr_cache_ids. Currently the semaphore is taken for reading only from kmem_cache_create, right before taking the slab_mutex, so right now there's no much point in using rwsem instead of mutex. However, once list_lru is made per-memcg it will allow list_lru initializations to proceed concurrently. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
memcg_limited_groups_array_size, which defines the size of memcg_caches arrays, sounds rather cumbersome. Also it doesn't point anyhow that it's related to kmem/caches stuff. So let's rename it to memcg_nr_cache_ids. It's concise and points us directly to memcg_cache_id. Also, rename kmem_limited_groups to memcg_cache_ida. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vladimir Davydov 提交于
This patch adds SHRINKER_MEMCG_AWARE flag. If a shrinker has this flag set, it will be called per memory cgroup. The memory cgroup to scan objects from is passed in shrink_control->memcg. If the memory cgroup is NULL, a memcg aware shrinker is supposed to scan objects from the global list. Unaware shrinkers are only called on global pressure with memcg=NULL. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 1 次提交
-
-
由 Johannes Weiner 提交于
Introduce the basic control files to account, partition, and limit memory using cgroups in default hierarchy mode. This interface versioning allows us to address fundamental design issues in the existing memory cgroup interface, further explained below. The old interface will be maintained indefinitely, but a clearer model and improved workload performance should encourage existing users to switch over to the new one eventually. The control files are thus: - memory.current shows the current consumption of the cgroup and its descendants, in bytes. - memory.low configures the lower end of the cgroup's expected memory consumption range. The kernel considers memory below that boundary to be a reserve - the minimum that the workload needs in order to make forward progress - and generally avoids reclaiming it, unless there is an imminent risk of entering an OOM situation. - memory.high configures the upper end of the cgroup's expected memory consumption range. A cgroup whose consumption grows beyond this threshold is forced into direct reclaim, to work off the excess and to throttle new allocations heavily, but is generally allowed to continue and the OOM killer is not invoked. - memory.max configures the hard maximum amount of memory that the cgroup is allowed to consume before the OOM killer is invoked. - memory.events shows event counters that indicate how often the cgroup was reclaimed while below memory.low, how often it was forced to reclaim excess beyond memory.high, how often it hit memory.max, and how often it entered OOM due to memory.max. This allows users to identify configuration problems when observing a degradation in workload performance. An overcommitted system will have an increased rate of low boundary breaches, whereas increased rates of high limit breaches, maximum hits, or even OOM situations will indicate internally overcommitted cgroups. For existing users of memory cgroups, the following deviations from the current interface are worth pointing out and explaining: - The original lower boundary, the soft limit, is defined as a limit that is per default unset. As a result, the set of cgroups that global reclaim prefers is opt-in, rather than opt-out. The costs for optimizing these mostly negative lookups are so high that the implementation, despite its enormous size, does not even provide the basic desirable behavior. First off, the soft limit has no hierarchical meaning. All configured groups are organized in a global rbtree and treated like equal peers, regardless where they are located in the hierarchy. This makes subtree delegation impossible. Second, the soft limit reclaim pass is so aggressive that it not just introduces high allocation latencies into the system, but also impacts system performance due to overreclaim, to the point where the feature becomes self-defeating. The memory.low boundary on the other hand is a top-down allocated reserve. A cgroup enjoys reclaim protection when it and all its ancestors are below their low boundaries, which makes delegation of subtrees possible. Secondly, new cgroups have no reserve per default and in the common case most cgroups are eligible for the preferred reclaim pass. This allows the new low boundary to be efficiently implemented with just a minor addition to the generic reclaim code, without the need for out-of-band data structures and reclaim passes. Because the generic reclaim code considers all cgroups except for the ones running low in the preferred first reclaim pass, overreclaim of individual groups is eliminated as well, resulting in much better overall workload performance. - The original high boundary, the hard limit, is defined as a strict limit that can not budge, even if the OOM killer has to be called. But this generally goes against the goal of making the most out of the available memory. The memory consumption of workloads varies during runtime, and that requires users to overcommit. But doing that with a strict upper limit requires either a fairly accurate prediction of the working set size or adding slack to the limit. Since working set size estimation is hard and error prone, and getting it wrong results in OOM kills, most users tend to err on the side of a looser limit and end up wasting precious resources. The memory.high boundary on the other hand can be set much more conservatively. When hit, it throttles allocations by forcing them into direct reclaim to work off the excess, but it never invokes the OOM killer. As a result, a high boundary that is chosen too aggressively will not terminate the processes, but instead it will lead to gradual performance degradation. The user can monitor this and make corrections until the minimal memory footprint that still gives acceptable performance is found. In extreme cases, with many concurrent allocations and a complete breakdown of reclaim progress within the group, the high boundary can be exceeded. But even then it's mostly better to satisfy the allocation from the slack available in other groups or the rest of the system than killing the group. Otherwise, memory.max is there to limit this type of spillover and ultimately contain buggy or even malicious applications. - The original control file names are unwieldy and inconsistent in many different ways. For example, the upper boundary hit count is exported in the memory.failcnt file, but an OOM event count has to be manually counted by listening to memory.oom_control events, and lower boundary / soft limit events have to be counted by first setting a threshold for that value and then counting those events. Also, usage and limit files encode their units in the filename. That makes the filenames very long, even though this is not information that a user needs to be reminded of every time they type out those names. To address these naming issues, as well as to signal clearly that the new interface carries a new configuration model, the naming conventions in it necessarily differ from the old interface. - The original limit files indicate the state of an unset limit with a very high number, and a configured limit can be unset by echoing -1 into those files. But that very high number is implementation and architecture dependent and not very descriptive. And while -1 can be understood as an underflow into the highest possible value, -2 or -10M etc. do not work, so it's not inconsistent. memory.low, memory.high, and memory.max will use the string "infinity" to indicate and set the highest possible value. [akpm@linux-foundation.org: use seq_puts() for basic strings] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-