- 30 10月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
The local variable @cgrp isn't used if !CONFIG_CGROUP_SCHED. Mark the variable with __maybe_unused to avoid a compile warning. Reported-by: N"kbuild-all@01.org" <kbuild-all@01.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 27 10月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
The basic cpu stat is currently shown with "cpu." prefix in cgroup.stat, and the same information is duplicated in cpu.stat when cpu controller is enabled. This is ugly and not very scalable as we want to expand the coverage of stat information which is always available. This patch makes cgroup core always create "cpu.stat" file and show the basic cpu stat there and calls the cpu controller to show the extra stats when enabled. This ensures that the same information isn't presented in multiple places and makes future expansion of basic stats easier. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
-
- 26 9月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Like other csets, init_css_set's dfl_cgrp is initialized when the cset gets linked. init_css_set gets linked in cgroup_init(). This has been fine till now but the recently added basic CPU usage accounting may end up accessing dfl_cgrp of init before cgroup_init() leading to the following oops. SELinux: Initializing. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: account_system_index_time+0x60/0x90 PGD 0 P4D 0 Oops: 0000 [#1] SMP Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS +1.9.3-20161025_171302-gandalf 04/01/2014 task: ffffffff81e10480 task.stack: ffffffff81e00000 RIP: 0010:account_system_index_time+0x60/0x90 RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002 RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003 RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000 RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000 R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100 R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8 FS: 0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <IRQ> account_system_time+0x45/0x60 account_process_tick+0x5a/0x140 update_process_times+0x22/0x60 tick_periodic+0x2b/0x90 tick_handle_periodic+0x25/0x70 timer_interrupt+0x15/0x20 __handle_irq_event_percpu+0x7e/0x1b0 handle_irq_event_percpu+0x23/0x60 handle_irq_event+0x42/0x70 handle_level_irq+0x83/0x100 handle_irq+0x6f/0x110 do_IRQ+0x46/0xd0 common_interrupt+0x9d/0x9d Fix it by statically initializing init_css_set.dfl_cgrp so that init's default cgroup is accessible from the get-go. Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting") Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 25 9月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
In cgroup1, while cpuacct isn't actually controlling any resources, it is a separate controller due to combination of two factors - 1. enabling cpu controller has significant side effects, and 2. we have to pick one of the hierarchies to account CPU usages on. cpuacct controller is effectively used to designate a hierarchy to track CPU usages on. cgroup2's unified hierarchy removes the second reason and we can account basic CPU usages by default. While we can use cpuacct for this purpose, both its interface and implementation leave a lot to be desired - it collects and exposes two sources of truth which don't agree with each other and some of the exposed statistics don't make much sense. Also, it propagates all the way up the hierarchy on each accounting event which is unnecessary. This patch adds basic resource accounting mechanism to cgroup2's unified hierarchy and accounts CPU usages using it. * All accountings are done per-cpu and don't propagate immediately. It just bumps the per-cgroup per-cpu counters and links to the parent's updated list if not already on it. * On a read, the per-cpu counters are collected into the global ones and then propagated upwards. Only the per-cpu counters which have changed since the last read are propagated. * CPU usage stats are collected and shown in "cgroup.stat" with "cpu." prefix. Total usage is collected from scheduling events. User/sys breakdown is sourced from tick sampling and adjusted to the usage using cputime_adjust(). This keeps the accounting side hot path O(1) and per-cpu and the read side O(nr_updated_since_last_read). v2: Minor changes and documentation updates as suggested by Waiman and Roman. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Waiman Long <longman@redhat.com> Cc: Roman Gushchin <guro@fb.com>
-
- 07 9月, 2017 1 次提交
-
-
由 Roman Gushchin 提交于
Commit fa06235b ("cgroup: reset css on destruction") caused css_reset callback to be called from the offlining path. Although it solves the problem mentioned in the commit description ("For instance, memory cgroup needs to reset memory.low, otherwise pages charged to a dead cgroup might never get reclaimed."), generally speaking, it's not correct. An offline cgroup can still be a resource domain, and we shouldn't grant it more resources than it had before deletion. For instance, if an offline memory cgroup has dirty pages, we should still imply i/o limits during writeback. The css_reset callback is designed to return the cgroup state into the original state, that means reset all limits and counters. It's spomething different from the offlining, and we shouldn't use it from the offlining path. Instead, we should adjust necessary settings from the per-controller css_offline callbacks (e.g. reset memory.low). Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2017 1 次提交
-
-
由 Dan Carpenter 提交于
"descendants" and "depth" are declared as int, so they can't be larger than INT_MAX. Static checkers complain and it's slightly confusing for humans as well so let's just remove these conditions. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 11 8月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Misc trivial changes to prepare for future changes. No functional difference. * Expose cgroup_get(), cgroup_tryget() and cgroup_parent(). * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp. * Rename cgroup_stats_show() to cgroup_stat_show() for consistency with the file name. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 03 8月, 2017 5 次提交
-
-
由 Tejun Heo 提交于
Each css_set directly points to the default cgroup it belongs to, so there's no reason to walk the cgrp_links list on the default hierarchy. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Roman Gushchin 提交于
As we already have a pointer to the parent cgroup in cgroup_destroy_locked(), we don't need to calculate it again to pass as an argument for cgroup1_check_for_release(). Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
A cgroup can consume resources even after being deleted by a user. For example, writing back dirty pages should be accounted and limited, despite the corresponding cgroup might contain no processes and being deleted by a user. In the current implementation a cgroup can remain in such "dying" state for an undefined amount of time. For instance, if a memory cgroup contains a pge, mlocked by a process belonging to an other cgroup. Although the lifecycle of a dying cgroup is out of user's control, it's important to have some insight of what's going on under the hood. In particular, it's handy to have a counter which will allow to detect css leaks. To solve this problem, add a cgroup.stat interface to the base cgroup control files with the following metrics: nr_descendants total number of visible descendant cgroups nr_dying_descendants total number of dying descendant cgroups Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
Creating cgroup hierearchies of unreasonable size can affect overall system performance. A user might want to limit the size of cgroup hierarchy. This is especially important if a user is delegating some cgroup sub-tree. To address this issue, introduce an ability to control the size of cgroup hierarchy. The cgroup.max.descendants control file allows to set the maximum allowed number of descendant cgroups. The cgroup.max.depth file controls the maximum depth of the cgroup tree. Both are single value r/w files, with "max" default value. The control files exist on each hierarchy level (including root). When a new cgroup is created, we check the total descendants and depth limits on each level, and if none of them are exceeded, a new cgroup is created. Only alive cgroups are counted, removed (dying) cgroups are ignored. Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
Keep track of the number of online and dying descent cgroups. This data will be used later to add an ability to control cgroup hierarchy (limit the depth and the number of descent cgroups) and display hierarchy stats. Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
- 29 7月, 2017 2 次提交
-
-
由 Shaohua Li 提交于
By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Shaohua Li 提交于
Now we have the facilities to implement exportfs operations. The idea is cgroup can export the fhandle info to userspace, then userspace uses fhandle to find the cgroup name. Another example is userspace can get fhandle for a cgroup and BPF uses the fhandle to filter info for the cgroup. Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 7月, 2017 2 次提交
-
-
由 Tejun Heo 提交于
Explain cgroup_enable_threaded() and note that the function can never be called on the root cgroup. Signed-off-by: NTejun Heo <tj@kernel.org> Suggested-by: NWaiman Long <longman@redhat.com>
-
由 Tejun Heo 提交于
cgroup_enable_threaded() checks that the cgroup doesn't have any tasks or children and fails the operation if so. This test is unnecessary because the first part is already checked by cgroup_can_be_thread_root() and the latter is unnecessary. The latter actually cause a behavioral oddity. Please consider the following hierarchy. All cgroups are domains. A / \ B C \ D If B is made threaded, C and D becomes invalid domains. Due to the no children restriction, threaded mode can't be enabled on C. For C and D, the only thing the user can do is removal. There is no reason for this restriction. Remove it. Acked-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 23 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
While refactoring, f7b2814b ("cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()") broke error return value from the function. The return value from the last operation is always overridden to zero. Fix it. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 21 7月, 2017 6 次提交
-
-
由 Waiman Long 提交于
Update debug controller so that it prints out debug info about thread mode. 1) The relationship between proc_cset and threaded_csets are displayed. 2) The status of being a thread root or threaded cgroup is displayed. This patch is extracted from Waiman's larger patch. v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links file as the same information is available from cgroup.type. Suggested by Waiman. - Threaded marking is moved to the previous patch. Patch-originally-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. A cgroup is always created as a domain and can be made threaded by writing to the "cgroup.type" file. When a cgroup becomes threaded, it becomes a member of a threaded subtree which is anchored at the closest ancestor which isn't threaded. The threads of the processes which are in a threaded subtree can be placed anywhere without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, the nearest ancestor which isn't threaded, is called the threaded domain and serves as the resource domain for the whole subtree. This is the last cgroup where domain controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. As the root cgroup is exempt from the no-internal-process constraint, it can serve as both a threaded domain and a parent to normal cgroups, so, unlike non-root cgroups, the root cgroup can have both domain and threaded children. Internally, in a threaded subtree, each css_set has its ->dom_cset pointing to a matching css_set which belongs to the threaded domain. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create(). Spotted by Waiman. - Documentation updated as suggested by Waiman. - cgroup.type content slightly reformatted. - Mark the debug controller threaded. v4: - Updated to the general idea of marking specific cgroups domain/threaded as suggested by PeterZ. v3: - Dropped "join" and always make mixed children join the parent's threaded subtree. v2: - After discussions with Waiman, support for mixed thread mode is added. This should address the issue that Peter pointed out where any nesting should be avoided for thread subtrees while coexisting with other domain cgroups. - Enabling / disabling thread mode now piggy backs on the existing control mask update mechanism. - Bug fixes and cleanup. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Tejun Heo 提交于
cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the dom_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a dom_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new enable-threaded-per-cgroup behavior. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgroup v2 is in the process of growing thread granularity support. A threaded subtree is composed of a thread root and threaded cgroups which are proper members of the subtree. The root cgroup of the subtree serves as the domain cgroup to which the processes (as opposed to threads / tasks) of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. Inside the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->dom_cgrp along with threaded css_set handling. * cgroup->dom_cgrp points to self for normal and thread roots. For proper thread subtree members, points to the dom_cgrp (the thread root). * css_set->dom_cset points to self if for normal and thread roots. If threaded, points to the css_set which belongs to the cgrp->dom_cgrp. The dom_cgrp serves as the resource domain and keeps the matching csses available. The dom_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their dom_csets to enable iteration of all threaded tasks. * cgroup->nr_threaded_children keeps track of the number of threaded children. This patch adds the above but doesn't actually use them yet. The following patches will build on top. v4: ->nr_threaded_children added. v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new enable-threaded-per-cgroup behavior. v2: Added cgroup_is_threaded() helper. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all handled by __cgroup_procs_write() on both v1 and v2. This patch reoragnizes the write path so that there are common helper functions that different write paths use. While this somewhat increases LOC, the different paths are no longer intertwined and each path has more flexibility to implement different behaviors which will be necessary for the planned v2 thread support. v3: - Restructured so that cgroup_procs_write_permission() takes @src_cgrp and @dst_cgrp. v2: - Rolled in Waiman's task reference count fix. - Updated on top of nsdelegate changes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com>
-
- 19 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
On subsystem registration, css_populate_dir() is not called on the new root css, so the interface files for the subsystem on cgrp_dfl_root aren't created on registration. This is a residue from the days when cgrp_dfl_root was used only as the parking spot for unused subsystems, which no longer is true as it's used as the root for cgroup2. This is often fine as later operations tend to create them as a part of mount (cgroup1) or subtree_control operations (cgroup2); however, it's not difficult to mount cgroup2 with the controller interface files missing as Waiman found out. Fix it by invoking css_populate_dir() on the root css on subsys registration. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NWaiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 17 7月, 2017 3 次提交
-
-
由 Tejun Heo 提交于
Implement trivial cgroup_has_tasks() which tests whether cgrp->nr_populated_csets is zero and replace the explicit local populated test in cgroup_subtree_control(). This simplifies the code and cgroup_has_tasks() will be used in more places later. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 08 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Subsystem migration methods shouldn't be called for empty migrations. cgroup_migrate_execute() implements this guarantee by bailing early if there are no source css_sets. This used to be correct before a79a908f ("cgroup: introduce cgroup namespaces"), but no longer since the commit because css_sets can stay pinned without tasks in them. This caused cgroup_migrate_execute() call into cpuset migration methods with an empty cgroup_taskset. cpuset migration methods correctly assume that cgroup_taskset_first() never returns NULL; however, due to the bug, it can, leading to the following oops. Unable to handle kernel paging request for data at address 0x00000960 Faulting instruction address: 0xc0000000001d6868 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W 4.12.0-rc4-next-20170609 #2 Workqueue: events cpuset_hotplug_workfn task: c00000000ca60580 task.stack: c00000000c728000 NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810 REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000 CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1 GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000 GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678 GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000 GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180 GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0 GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20 GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20 GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000 NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0 LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 Call Trace: [c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable) [c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450 [c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360 [c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20 [c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580 [c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0 [c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0 [c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74 Instruction dump: f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25 60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200 ---[ end trace dcaaf98fb36d9e64 ]--- This patch fixes the bug by adding an explicit nr_tasks counter to cgroup_taskset and skipping calling the migration methods if the counter is zero. While at it, remove the now spurious check on no source css_sets. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Roman Gushchin <guro@fb.com> Cc: stable@vger.kernel.org # v4.6+ Fixes: a79a908f ("cgroup: introduce cgroup namespaces") Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
-
- 29 6月, 2017 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup only supports delegation to !root users and cgroup namespaces don't get any special treatments. This limits the usefulness of cgroup namespaces as they by themselves can't be safe delegation boundaries. A process inside a cgroup can change the resource control knobs of the parent in the namespace root and may move processes in and out of the namespace if cgroups outside its namespace are visible somehow. This patch adds a new mount option "nsdelegate" which makes cgroup namespaces delegation boundaries. If set, cgroup behaves as if write permission based delegation took place at namespace boundaries - writes to the resource control knobs from the namespace root are denied and migration crossing the namespace boundary aren't allowed from inside the namespace. This allows cgroup namespace to function as a delegation boundary by itself. v2: Silently ignore nsdelegate specified on !init mounts. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Aravind Anbudurai <aru7@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com>
-
由 Tejun Heo 提交于
Restructure cgroup_procs_write_permission() to make extending permission logic easier. This patch doesn't cause any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 15 6月, 2017 1 次提交
-
-
由 Waiman Long 提交于
The reference count in the css_set data structure was used as a proxy of the number of tasks attached to that css_set. However, that count is actually not an accurate measure especially with thread mode support. So a new variable nr_tasks is added to the css_set to keep track of the actual task count. This new variable is protected by the css_set_lock. Functions that require the actual task count are updated to use the new variable. tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps. Refreshed on top of cgroup/for-v4.13 which dropped on css_set_populated() -> nr_tasks conversion. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 18 5月, 2017 1 次提交
-
-
由 Waiman Long 提交于
The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.5+
-
- 02 5月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
a590b90d ("cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()") converted most cgroup_get() usages to cgroup_get_live() leaving cgroup_sk_alloc() the sole user of cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering unused warning for cgroup_get(). Silence the warning by adding __maybe_unused to cgroup_get(). Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>
-
- 29 4月, 2017 2 次提交
-
-
由 Zefan Li 提交于
Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two different superblocks") is broken. Now we try to fix the race by delaying the initialization of cgroup root refcnt until a superblock has been allocated. Reported-by: NDmitry Vyukov <dvyukov@google.com> Reported-by: NAndrei Vagin <avagin@virtuozzo.com> Tested-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgroup_get() expected to be called only on live cgroups and triggers warning on a dead cgroup; however, cgroup_sk_alloc() may be called while cloning a socket which is left in an empty and removed cgroup and thus may legitimately duplicate its reference on a dead cgroup. This currently triggers the following warning spuriously. WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60 ... [<ffffffff8107e123>] __warn+0xd3/0xf0 [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20 [<ffffffff810ff465>] cgroup_get+0x55/0x60 [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0 [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390 [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0 [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0 [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670 [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0 [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00 [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0 [<ffffffff81837cb2>] ip6_input+0x32/0xb0 [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0 [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0 [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0 [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70 [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80 [<ffffffff817787d8>] napi_gro_frags+0x208/0x270 [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40 [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90 [<ffffffff81778b30>] net_rx_action+0x210/0x350 [<ffffffff8188c426>] __do_softirq+0x106/0x2c7 [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0 [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI> [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20 [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0 [<ffffffff8103edd1>] start_secondary+0xf1/0x100 This patch renames the existing cgroup_get() with the dead cgroup warning to cgroup_get_live() after cgroup_kn_lock_live() and introduces the new cgroup_get() which doesn't check whether the cgroup is live or dead. All existing cgroup_get() users except for cgroup_sk_alloc() are converted to use cgroup_get_live(). Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets") Cc: stable@vger.kernel.org # v4.5+ Cc: Johannes Weiner <hannes@cmpxchg.org> Reported-by: NChris Mason <clm@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 17 3月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: NTejun Heo <tj@kernel.org> Suggested-by: NOleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: NChris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3) Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 10 3月, 2017 1 次提交
-
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: disble||disable disbled||disabled I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c untouched. The macro is not referenced at all, but this commit is touching only comment blocks just in case. Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2017 1 次提交
-
-
由 Elena Reshetova 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 07 3月, 2017 1 次提交
-
-
由 Elena Reshetova 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-