- 25 9月, 2014 1 次提交
-
-
由 Zefan Li 提交于
When we change cpuset.memory_spread_{page,slab}, cpuset will flip PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset. This should be done using atomic bitops, but currently we don't, which is broken. Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened when one thread tried to clear PF_USED_MATH while at the same time another thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on the same task. Here's the full report: https://lkml.org/lkml/2014/9/19/230 To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags. v4: - updated mm/slab.c. (Fengguang Wu) - updated Documentation. Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: Kees Cook <keescook@chromium.org> Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time") Cc: <stable@vger.kernel.org> # 2.6.31+ Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 09 8月, 2014 2 次提交
-
-
由 Johannes Weiner 提交于
The memcg uncharging code that is involved towards the end of a page's lifetime - truncation, reclaim, swapout, migration - is impressively complicated and fragile. Because anonymous and file pages were always charged before they had their page->mapping established, uncharges had to happen when the page type could still be known from the context; as in unmap for anonymous, page cache removal for file and shmem pages, and swap cache truncation for swap pages. However, these operations happen well before the page is actually freed, and so a lot of synchronization is necessary: - Charging, uncharging, page migration, and charge migration all need to take a per-page bit spinlock as they could race with uncharging. - Swap cache truncation happens during both swap-in and swap-out, and possibly repeatedly before the page is actually freed. This means that the memcg swapout code is called from many contexts that make no sense and it has to figure out the direction from page state to make sure memory and memory+swap are always correctly charged. - On page migration, the old page might be unmapped but then reused, so memcg code has to prevent untimely uncharging in that case. Because this code - which should be a simple charge transfer - is so special-cased, it is not reusable for replace_page_cache(). But now that charged pages always have a page->mapping, introduce mem_cgroup_uncharge(), which is called after the final put_page(), when we know for sure that nobody is looking at the page anymore. For page migration, introduce mem_cgroup_migrate(), which is called after the migration is successful and the new page is fully rmapped. Because the old page is no longer uncharged after migration, prevent double charges by decoupling the page's memcg association (PCG_USED and pc->mem_cgroup) from the page holding an actual charge. The new bits PCG_MEM and PCG_MEMSW represent the respective charges and are transferred to the new page during migration. mem_cgroup_migrate() is suitable for replace_page_cache() as well, which gets rid of mem_cgroup_replace_page_cache(). However, care needs to be taken because both the source and the target page can already be charged and on the LRU when fuse is splicing: grab the page lock on the charge moving side to prevent changing pc->mem_cgroup of a page under migration. Also, the lruvecs of both pages change as we uncharge the old and charge the new during migration, and putback may race with us, so grab the lru lock and isolate the pages iff on LRU to prevent races and ensure the pages are on the right lruvec afterward. Swap accounting is massively simplified: because the page is no longer uncharged as early as swap cache deletion, a new mem_cgroup_swapout() can transfer the page's memory+swap charge (PCG_MEMSW) to the swap entry before the final put_page() in page reclaim. Finally, page_cgroup changes are now protected by whatever protection the page itself offers: anonymous pages are charged under the page table lock, whereas page cache insertions, swapin, and migration hold the page lock. Uncharging happens under full exclusion with no outstanding references. Charging and uncharging also ensure that the page is off-LRU, which serializes against charge migration. Remove the very costly page_cgroup lock and set pc->flags non-atomically. [mhocko@suse.cz: mem_cgroup_charge_statistics needs preempt_disable] [vdavydov@parallels.com: fix flags definition] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Tested-by: NJet Chen <jet.chen@intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Tested-by: NFelipe Balbi <balbi@ti.com> Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
These patches rework memcg charge lifetime to integrate more naturally with the lifetime of user pages. This drastically simplifies the code and reduces charging and uncharging overhead. The most expensive part of charging and uncharging is the page_cgroup bit spinlock, which is removed entirely after this series. Here are the top-10 profile entries of a stress test that reads a 128G sparse file on a freshly booted box, without even a dedicated cgroup (i.e. executing in the root memcg). Before: 15.36% cat [kernel.kallsyms] [k] copy_user_generic_string 13.31% cat [kernel.kallsyms] [k] memset 11.48% cat [kernel.kallsyms] [k] do_mpage_readpage 4.23% cat [kernel.kallsyms] [k] get_page_from_freelist 2.38% cat [kernel.kallsyms] [k] put_page 2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge 2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common 1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn After: 15.67% cat [kernel.kallsyms] [k] copy_user_generic_string 13.48% cat [kernel.kallsyms] [k] memset 11.42% cat [kernel.kallsyms] [k] do_mpage_readpage 3.98% cat [kernel.kallsyms] [k] get_page_from_freelist 2.46% cat [kernel.kallsyms] [k] put_page 2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list 1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup 1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn 1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk 1.30% cat [kernel.kallsyms] [k] kfree As you can see, the memcg footprint has shrunk quite a bit. text data bss dec hex filename 37970 9892 400 48262 bc86 mm/memcontrol.o.old 35239 9892 400 45531 b1db mm/memcontrol.o This patch (of 4): The memcg charge API charges pages before they are rmapped - i.e. have an actual "type" - and so every callsite needs its own set of charge and uncharge functions to know what type is being operated on. Worse, uncharge has to happen from a context that is still type-specific, rather than at the end of the page's lifetime with exclusive access, and so requires a lot of synchronization. Rewrite the charge API to provide a generic set of try_charge(), commit_charge() and cancel_charge() transaction operations, much like what's currently done for swap-in: mem_cgroup_try_charge() attempts to reserve a charge, reclaiming pages from the memcg if necessary. mem_cgroup_commit_charge() commits the page to the charge once it has a valid page->mapping and PageAnon() reliably tells the type. mem_cgroup_cancel_charge() aborts the transaction. This reduces the charge API and enables subsequent patches to drastically simplify uncharging. As pages need to be committed after rmap is established but before they are added to the LRU, page_add_new_anon_rmap() must stop doing LRU additions again. Revive lru_cache_add_active_or_unevictable(). [hughd@google.com: fix shmem_unuse] [hughd@google.com: Add comments on the private use of -EAGAIN] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 7月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
Until now, cftype arrays carried files for both the default and legacy hierarchies and the files which needed to be used on only one of them were flagged with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE. This gets confusing very quickly and we may end up exposing interface files to the default hierarchy without thinking it through. This patch makes cgroup core provide separate sets of interfaces for cftype handling so that the cftypes for the default and legacy hierarchies are clearly distinguished. The previous two patches renamed the existing ones so that they clearly indicate that they're for the legacy hierarchies. This patch adds the interface for the default hierarchy and apply them selectively depending on the hierarchy type. * cftypes added through cgroup_subsys->dfl_cftypes and cgroup_add_dfl_cftypes() only show up on the default hierarchy. * cftypes added through cgroup_subsys->legacy_cftypes and cgroup_add_legacy_cftypes() only show up on the legacy hierarchies. * cgroup_subsys->dfl_cftypes and ->legacy_cftypes can point to the same array for the cases where the interface files are identical on both types of hierarchies. * This makes all the existing subsystem interface files legacy-only by default and all subsystems will have no interface file created when enabled on the default hierarchy. Each subsystem should explicitly review and compose the interface for the default hierarchy. * A boot param "cgroup__DEVEL__legacy_files_on_dfl" is added which makes subsystems which haven't decided the interface files for the default hierarchy to present the legacy files on the default hierarchy so that its behavior on the default hierarchy can be tested. As the awkward name suggests, this is for development only. * memcg's CFTYPE_INSANE on "use_hierarchy" is noop now as the whole array isn't used on the default hierarchy. The flag is removed. v2: Updated documentation for cgroup__DEVEL__legacy_files_on_dfl. v3: Clear CFTYPE_ONLY_ON_DFL and CFTYPE_INSANE when cfts are removed as suggested by Li. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NNeil Horman <nhorman@tuxdriver.com> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Aristeu Rozanski <aris@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
-
- 09 7月, 2014 2 次提交
-
-
由 Tejun Heo 提交于
Currently, the blkio subsystem attributes all of writeback IOs to the root. One of the issues is that there's no way to tell who originated a writeback IO from block layer. Those IOs are usually issued asynchronously from a task which didn't have anything to do with actually generating the dirty pages. The memory subsystem, when enabled, already keeps track of the ownership of each dirty page and it's desirable for blkio to piggyback instead of adding its own per-page tag. blkio piggybacking on memory is an implementation detail which preferably should be handled automatically without requiring explicit userland action. To achieve that, this patch implements cgroup_subsys->depends_on which contains the mask of subsystems which should be enabled together when the subsystem is enabled. The previous patches already implemented the support for enabled but invisible subsystems and cgroup_subsys->depends_on can be easily implemented by updating cgroup_refresh_child_subsys_mask() so that it calculates cgroup->child_subsys_mask considering cgroup_subsys->depends_on of the explicitly enabled subsystems. Documentation/cgroups/unified-hierarchy.txt is updated to explain that subsystems may not become immediately available after being unused from userland and that dependency could be a factor in it. As subsystems may already keep residual references, this doesn't significantly change how subsystem rebinding can be used. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
-
由 Tejun Heo 提交于
cgroup is implementing support for subsystem dependency which would require a way to enable a subsystem even when it's not directly configured through "cgroup.subtree_control". The previous patches added support for explicitly and implicitly enabled subsystems and showing/hiding their interface files. An explicitly enabled subsystem may become implicitly enabled if it's turned off through "cgroup.subtree_control" but there are subsystems depending on it. In such cases, the subsystem, as it's turned off when seen from userland, shouldn't enforce any resource control. Also, the subsystem may be explicitly turned on later again and its interface files should be as close to the intial state as possible. This patch adds cgroup_subsys->css_reset() which is invoked when a css is hidden. The callback should disable resource control and reset the state to the vanilla state. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
-
- 07 6月, 2014 1 次提交
-
-
由 Michal Hocko 提交于
Memory reclaim always uses swappiness of the reclaim target memcg (origin of the memory pressure) or vm_swappiness for global memory reclaim. This behavior was consistent (except for difference between global and hard limit reclaim) because swappiness was enforced to be consistent within each memcg hierarchy. After "mm: memcontrol: remove hierarchy restrictions for swappiness and oom_control" each memcg can have its own swappiness independent of hierarchical parents, though, so the consistency guarantee is gone. This can lead to an unexpected behavior. Say that a group is explicitly configured to not swapout by memory.swappiness=0 but its memory gets swapped out anyway when the memory pressure comes from its parent with a It is also unexpected that the knob is meaningless without setting the hard limit which would trigger the reclaim and enforce the swappiness. There are setups where the hard limit is configured higher in the hierarchy by an administrator and children groups are under control of somebody else who is interested in the swapout behavior but not necessarily about the memory limit. From a semantic point of view swappiness is an attribute defining anon vs. file proportional scanning of LRU which is memcg specific (unlike charges which are propagated up the hierarchy) so it should be applied to the particular memcg's LRU regardless where the memory pressure comes from. This patch removes vmscan_swappiness() and stores the swappiness into the scan_control structure. mem_cgroup_swappiness is then used to provide the correct value before shrink_lruvec is called. The global vm_swappiness is used for the root memcg. [hughd@google.com: oopses immediately when booted with cgroup_disable=memory] Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 6月, 2014 2 次提交
-
-
由 Vladimir Davydov 提交于
Kmemcg is currently under development and lacks some important features. In particular, it does not have support of kmem reclaim on memory pressure inside cgroup, which practically makes it unusable in real life. Let's warn about it in both Kconfig and Documentation to prevent complaints arising. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Per-memcg swappiness and oom killing can currently not be tweaked on a memcg that is part of a hierarchy, but not the root of that hierarchy. Users have complained that they can't configure this when they turned on hierarchy mode. In fact, with hierarchy mode becoming the default, this restriction disables the tunables entirely. But there is no good reason for this restriction. The settings for swappiness and OOM killing are taken from whatever memcg whose limit triggered reclaim and OOM invocation, regardless of its position in the hierarchy tree. Allow setting swappiness on any group. The knob on the root memcg already reads the global VM swappiness, make it writable as well. Allow disabling the OOM killer on any non-root memcg. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 5月, 2014 1 次提交
-
-
由 Michal Hocko 提交于
Tejun has correctly pointed out that tasks/children test in mem_cgroup_force_empty is not correct because there is no other locking which preserves this state throughout the rest of the function so both new tasks can join the group or new children groups can be added while somebody is writing to memory.force_empty. A new task would break mem_cgroup_reparent_charges expectation that all failures as described by mem_cgroup_force_empty_list are temporal and there is no way out. The main use case for the knob as described by Documentation/cgroups/memory.txt is to: " The typical use case for this interface is before calling rmdir(). Because rmdir() moves all pages to parent, some out-of-use page caches can be moved to the parent. If you want to avoid that, force_empty will be useful. " This means that reparenting is not really required as rmdir will reparent pages implicitly from the safe context. If we remove it from mem_cgroup_force_empty then we are safe even with existing tasks because the number of reclaim attempts is bounded. Moreover the knob still does what the documentation claims (modulo reparenting which doesn't make any difference) and users might expect. Longterm we want to deprecate the whole knob and put the reparented pages to the tail of parent LRU during cgroup removal. tj: Removed unused variable @cgrp from mem_cgroup_force_empty() Signed-off-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NLi Zefan <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 26 4月, 2014 1 次提交
-
-
由 Tejun Heo 提交于
Unified hierarchy will be the new version of cgroup interface. This patch adds Documentation/cgroups/unified-hierarchy.txt which describes the design and rationales of unified hierarchy. v2: Grammatical updates as per Randy Dunlap's review. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org>
-
- 08 4月, 2014 2 次提交
-
-
由 Michal Hocko 提交于
mem_cgroup_newpage_charge is used only for charging anonymous memory so it is better to rename it to mem_cgroup_charge_anon. mem_cgroup_cache_charge is used for file backed memory so rename it to mem_cgroup_charge_file. Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The res_counter_{charge,uncharge}_locked() variants are not used in the kernel outside of the resource counter code itself, so remove the interface. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Tim Hockin <thockin@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 1月, 2014 1 次提交
-
-
由 Daniel Borkmann 提交于
It would be useful e.g. in a server or desktop environment to have a facility in the notion of fine-grained "per application" or "per application group" firewall policies. Probably, users in the mobile, embedded area (e.g. Android based) with different security policy requirements for application groups could have great benefit from that as well. For example, with a little bit of configuration effort, an admin could whitelist well-known applications, and thus block otherwise unwanted "hard-to-track" applications like [1] from a user's machine. Blocking is just one example, but it is not limited to that, meaning we can have much different scenarios/policies that netfilter allows us than just blocking, e.g. fine grained settings where applications are allowed to connect/send traffic to, application traffic marking/conntracking, application-specific packet mangling, and so on. Implementation of PID-based matching would not be appropriate as they frequently change, and child tracking would make that even more complex and ugly. Cgroups would be a perfect candidate for accomplishing that as they associate a set of tasks with a set of parameters for one or more subsystems, in our case the netfilter subsystem, which, of course, can be combined with other cgroup subsystems into something more complex if needed. As mentioned, to overcome this constraint, such processes could be placed into one or multiple cgroups where different fine-grained rules can be defined depending on the application scenario, while e.g. everything else that is not part of that could be dropped (or vice versa), thus making life harder for unwanted processes to communicate to the outside world. So, we make use of cgroups here to track jobs and limit their resources in terms of iptables policies; in other words, limiting, tracking, etc what they are allowed to communicate. In our case we're working on outgoing traffic based on which local socket that originated from. Also, one doesn't even need to have an a-prio knowledge of the application internals regarding their particular use of ports or protocols. Matching is *extremly* lightweight as we just test for the sk_classid marker of sockets, originating from net_cls. net_cls and netfilter do not contradict each other; in fact, each construct can live as standalone or they can be used in combination with each other, which is perfectly fine, plus it serves Tejun's requirement to not introduce a new cgroups subsystem. Through this, we result in a very minimal and efficient module, and don't add anything except netfilter code. One possible, minimal usage example (many other iptables options can be applied obviously): 1) Configuring cgroups if not already done, e.g.: mkdir /sys/fs/cgroup/net_cls mount -t cgroup -o net_cls net_cls /sys/fs/cgroup/net_cls mkdir /sys/fs/cgroup/net_cls/0 echo 1 > /sys/fs/cgroup/net_cls/0/net_cls.classid (resp. a real flow handle id for tc) 2) Configuring netfilter (iptables-nftables), e.g.: iptables -A OUTPUT -m cgroup ! --cgroup 1 -j DROP 3) Running applications, e.g.: ping 208.67.222.222 <pid:1799> echo 1799 > /sys/fs/cgroup/net_cls/0/tasks 64 bytes from 208.67.222.222: icmp_seq=44 ttl=49 time=11.9 ms [...] ping 208.67.220.220 <pid:1804> ping: sendmsg: Operation not permitted [...] echo 1804 > /sys/fs/cgroup/net_cls/0/tasks 64 bytes from 208.67.220.220: icmp_seq=89 ttl=56 time=19.0 ms [...] Of course, real-world deployments would make use of cgroups user space toolsuite, or own custom policy daemons dynamically moving applications from/to various cgroups. [1] http://www.blackhat.com/presentations/bh-europe-06/bh-eu-06-biondi/bh-eu-06-biondi-up.pdfSigned-off-by: NDaniel Borkmann <dborkman@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: cgroups@vger.kernel.org Acked-by: NLi Zefan <lizefan@huawei.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 31 12月, 2013 1 次提交
-
-
由 Masanari Iida 提交于
Correct spelling typo in memory.txt and resource_counter.txt Signed-off-by: NMasanari Iida <standby24x7@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 11 12月, 2013 1 次提交
-
-
由 Kees Cook 提交于
Stop that, stop that! You're not going to do a song while I'm here. Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NMauro Carvalho Chehab <m.chehab@samsung.com>
-
- 23 11月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
cgroup_event is only available in memcg now. Let's brand it that way. While at it, add a comment encouraging deprecation of the feature and remove the respective section from cgroup documentation. This patch is cosmetic. v3: Typo update as per Li Zefan. v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NMichal Hocko <mhocko@suse.cz>
-
- 13 11月, 2013 1 次提交
-
-
由 Ying Han 提交于
The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the "hierarchical_" stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the "hierarchical_*" under a. $ cd /sys/fs/cgroup $ echo 1 > memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) & The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 $ cat a/b/memory.numa_stat total=908 N0=552 N1=317 N2=39 N3=0 file=850 N0=549 N1=301 N2=0 N3=0 anon=58 N0=3 N1=16 N2=39 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: NYing Han <yinghan@google.com> Signed-off-by: NGreg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 9月, 2013 1 次提交
-
-
由 Sha Zhengju 提交于
Signed-off-by: NSha Zhengju <handai.szj@taobao.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Greg Thelen <gthelen@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 7月, 2013 1 次提交
-
-
由 Li Zefan 提交于
hugetlb cgroup has already been implemented. Signed-off-by: NLi Zefan <lizefan@huawei.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NRob Landley <rob@landley.net> Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 6月, 2013 1 次提交
-
-
由 Aaro Koskinen 提交于
Documentation for inactive_anon / active_anon was mixed up. Fix that. Signed-off-by: NAaro Koskinen <aaro.koskinen@nsn.com> Acked-by: NRob Landley <rob@landley.net> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 19 6月, 2013 1 次提交
-
-
由 Viresh Kumar 提交于
Most of the stuff from kernel/sched.c was moved to kernel/sched/core.c long time back and the comments/Documentation never got updated. I figured it out when I was going through sched-domains.txt and so thought of fixing it globally. I haven't crossed check if the stuff that is referenced in sched/core.c by all these files is still present and hasn't changed as that wasn't the motive behind this patch. Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/cdff76a265326ab8d71922a1db5be599f20aad45.1370329560.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 5月, 2013 1 次提交
-
-
由 Anatol Pomozov 提交于
Signed-off-by: NAnatol Pomozov <anatol.pomozov@gmail.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 15 5月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
With the recent updates, blk-throttle is finally ready for proper hierarchy support. Dispatching now honors service_queue->parent_sq and propagates correctly. The only thing missing is setting ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup hierarchy. This patch updates throtl_pd_init() such that service_queues form the same hierarchy as the cgroup hierarchy if sane_behavior is enabled. As this concludes proper hierarchy support for blkcg, the shameful .broken_hierarchy tag is removed from blkio_subsys. v2: Updated blkio-controller.txt as suggested by Vivek. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com> Cc: Li Zefan <lizefan@huawei.com>
-
- 08 5月, 2013 1 次提交
-
-
由 David Rientjes 提交于
This exports the amount of anonymous transparent hugepages for each memcg via the new "rss_huge" stat in memory.stat. The units are in bytes. This is helpful to determine the hugepage utilization for individual jobs on the system in comparison to rss and opportunities where MADV_HUGEPAGE may be helpful. The amount of anonymous transparent hugepages is also included in "rss" for backwards compatibility. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 4月, 2013 1 次提交
-
-
由 Anton Vorontsov 提交于
With this patch userland applications that want to maintain the interactivity/memory allocation cost can use the pressure level notifications. The levels are defined like this: The "low" level means that the system is reclaiming memory for new allocations. Monitoring this reclaiming activity might be useful for maintaining cache level. Upon notification, the program (typically "Activity Manager") might analyze vmstat and act in advance (i.e. prematurely shutdown unimportant services). The "medium" level means that the system is experiencing medium memory pressure, the system might be making swap, paging out active file caches, etc. Upon this event applications may decide to further analyze vmstat/zoneinfo/memcg or internal memory usage statistics and free any resources that can be easily reconstructed or re-read from a disk. The "critical" level means that the system is actively thrashing, it is about to out of memory (OOM) or even the in-kernel OOM killer is on its way to trigger. Applications should do whatever they can to help the system. It might be too late to consult with vmstat or any other statistics, so it's advisable to take an immediate action. The events are propagated upward until the event is handled, i.e. the events are not pass-through. Here is what this means: for example you have three cgroups: A->B->C. Now you set up an event listener on cgroups A, B and C, and suppose group C experiences some pressure. In this situation, only group C will receive the notification, i.e. groups A and B will not receive it. This is done to avoid excessive "broadcasting" of messages, which disturbs the system and which is especially bad if we are low on memory or thrashing. So, organize the cgroups wisely, or propagate the events manually (or, ask us to implement the pass-through events, explaining why would you need them.) Performance wise, the memory pressure notifications feature itself is lightweight and does not require much of bookkeeping, in contrast to the rest of memcg features. Unfortunately, as of current memcg implementation, pages accounting is an inseparable part and cannot be turned off. The good news is that there are some efforts[1] to improve the situation; plus, implementing the same, fully API-compatible[2] interface for CONFIG_MEMCG=n case (e.g. embedded) is also a viable option, so it will not require any changes on the userland side. [1] http://permalink.gmane.org/gmane.linux.kernel.cgroups/6291 [2] http://lkml.org/lkml/2013/2/21/454 [akpm@linux-foundation.org: coding-style fixes] [akpm@linux-foundation.org: fix CONFIG_CGROPUPS=n warnings] Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org> Acked-by: NKirill A. Shutemov <kirill@shutemov.name> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Glauber Costa <glommer@parallels.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Leonid Moiseichuk <leonid.moiseichuk@nokia.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: John Stultz <john.stultz@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 4月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
This reverts commit 84cfb6ab. There are scheduled changes which make use of the removed callback. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Rami Rosen <ramirose@gmail.com> Cc: Li Zefan <lizefan@huawei.com>
-
- 11 4月, 2013 1 次提交
-
-
由 Rami Rosen 提交于
The bind() method of cgroup_subsys is not used in any of the controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio, devices, perf, hugetlb, cpu and cpuacct) tj: Removed the entry on ->bind() from Documentation/cgroups/cgroups.txt. Also updated a couple paragraphs which were suggesting that dynamic re-binding may be implemented. It's not gonna. Signed-off-by: NRami Rosen <ramirose@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 09 4月, 2013 1 次提交
-
-
由 Rami Rosen 提交于
This patch adds a new file, Documentation/cgroups/net_cls.txt, with info about net_cls cgroups, and updates the 00-INDEX accordingly. Signed-off-by: NRami Rosen <ramirose@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2013 1 次提交
-
-
由 Rami Rosen 提交于
This trivial patch removes a word which appears twice in Documentation/cgroup/cgroup.txt. Signed-off-by: NRami Rosen <ramirose@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 27 3月, 2013 1 次提交
-
-
由 Paul Bolle 提交于
Signed-off-by: NPaul Bolle <pebolle@tiscali.nl> Acked-by: NRob Landley <rob@landley.net> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 20 3月, 2013 1 次提交
-
-
由 Aristeu Rozanski 提交于
This patch makes exception changes to propagate down in hierarchy respecting when possible local exceptions. New exceptions allowing additional access to devices won't be propagated, but it'll be possible to add an exception to access all of part of the newly allowed device(s). New exceptions disallowing access to devices will be propagated down and the local group's exceptions will be revalidated for the new situation. Example: A / \ B group behavior exceptions A allow "b 8:* rwm", "c 116:1 rw" B deny "c 1:3 rwm", "c 116:2 rwm", "b 3:* rwm" If a new exception is added to group A: # echo "c 116:* r" > A/devices.deny it'll propagate down and after revalidating B's local exceptions, the exception "c 116:2 rwm" will be removed. In case parent's exceptions change and local exceptions are not allowed anymore, they'll be deleted. v7: - do not allow behavior change when the cgroup has children - update documentation v6: fixed issues pointed by Serge Hallyn - only copy parent's exceptions while propagating behavior if the local behavior is different - while propagating exceptions, do not clear and copy parent's: it'd be against the premise we don't propagate access to more devices v5: fixed issues pointed by Serge Hallyn - updated documentation - not propagating when an exception is written to devices.allow - when propagating a new behavior, clean the local exceptions list if they're for a different behavior v4: fixed issues pointed by Tejun Heo - separated function to walk the tree and collect valid propagation targets v3: fixed issues pointed by Tejun Heo - update documentation - move css_online/css_offline changes to a new patch - use cgroup_for_each_descendant_pre() instead of own descendant walk - move exception_copy rework to a separared patch - move exception_clean rework to a separated patch v2: fixed issues pointed by Tejun Heo - instead of keeping the local settings that won't apply anymore, remove them Cc: Tejun Heo <tj@kernel.org> Cc: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NAristeu Rozanski <aris@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 13 3月, 2013 1 次提交
-
-
由 Li Zefan 提交于
cpuset no longer nests cgroup_mutex inside cpu_hotplug lock, so we don't have to release cgroup_mutex before calling css_offline(). Signed-off-by: NLi Zefan <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 28 2月, 2013 1 次提交
-
-
由 Warren Turkal 提交于
Signed-off-by: NWarren Turkal <wt@ooyala.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 1月, 2013 1 次提交
-
-
由 Tejun Heo 提交于
With the previous two patches, all cfqg scheduling decisions are based on vfraction and ready for hierarchy support. The only thing which keeps the behavior flat is cfqg_flat_parent() which makes vfraction calculation consider all non-root cfqgs children of the root cfqg. Replace it with cfqg_parent() which returns the real parent. This enables full blkcg hierarchy support for cfq-iosched. For example, consider the following hierarchy. root / \ A:500 B:250 / \ AA:500 AB:1000 For simplicity, let's say all the leaf nodes have active tasks and are on service tree. For each leaf node, vfraction would be AA: (500 / 1500) * (500 / 750) =~ 0.2222 AB: (1000 / 1500) * (500 / 750) =~ 0.4444 B: (250 / 750) =~ 0.3333 and vdisktime will be distributed accordingly. For more detail, please refer to Documentation/block/cfq-iosched.txt. v2: cfq-iosched.txt updated to describe group scheduling as suggested by Vivek. v3: blkio-controller.txt updated. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NVivek Goyal <vgoyal@redhat.com>
-
- 08 1月, 2013 1 次提交
-
-
由 Greg Thelen 提交于
Move the cgroup_event_listener.c tool from Documentation into the new tools/cgroup directory. This change involves wiring cgroup_event_listener.c into the tools/ make system so that is can be built with: $ make tools/cgroup Signed-off-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 19 12月, 2012 3 次提交
-
-
由 Glauber Costa 提交于
Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Michal Hocko <mhocko@suse.cz> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
Signed-off-by: NGlauber Costa <glommer@parallels.com> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
It is useful to know how many charges are still left after a call to res_counter_uncharge. While it is possible to issue a res_counter_read after uncharge, this can be racy. If we need, for instance, to take some action when the counters drop down to 0, only one of the callers should see it. This is the same semantics as the atomic variables in the kernel. Since the current return value is void, we don't need to worry about anything breaking due to this change: nobody relied on that, and only users appearing from now on will be checking this value. Signed-off-by: NGlauber Costa <glommer@parallels.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Suleiman Souhlal <suleiman@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: JoonSoo Kim <js1304@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2012 1 次提交
-
-
由 Lai Jiangshan 提交于
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Acked-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-