- 07 9月, 2017 1 次提交
-
-
由 Roman Gushchin 提交于
Commit fa06235b ("cgroup: reset css on destruction") caused css_reset callback to be called from the offlining path. Although it solves the problem mentioned in the commit description ("For instance, memory cgroup needs to reset memory.low, otherwise pages charged to a dead cgroup might never get reclaimed."), generally speaking, it's not correct. An offline cgroup can still be a resource domain, and we shouldn't grant it more resources than it had before deletion. For instance, if an offline memory cgroup has dirty pages, we should still imply i/o limits during writeback. The css_reset callback is designed to return the cgroup state into the original state, that means reset all limits and counters. It's spomething different from the offlining, and we shouldn't use it from the offlining path. Instead, we should adjust necessary settings from the per-controller css_offline callbacks (e.g. reset memory.low). Link: http://lkml.kernel.org/r/20170727130428.28856-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2017 1 次提交
-
-
由 Dan Carpenter 提交于
"descendants" and "depth" are declared as int, so they can't be larger than INT_MAX. Static checkers complain and it's slightly confusing for humans as well so let's just remove these conditions. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 11 8月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Misc trivial changes to prepare for future changes. No functional difference. * Expose cgroup_get(), cgroup_tryget() and cgroup_parent(). * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp. * Rename cgroup_stats_show() to cgroup_stat_show() for consistency with the file name. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 03 8月, 2017 5 次提交
-
-
由 Tejun Heo 提交于
Each css_set directly points to the default cgroup it belongs to, so there's no reason to walk the cgrp_links list on the default hierarchy. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Roman Gushchin 提交于
As we already have a pointer to the parent cgroup in cgroup_destroy_locked(), we don't need to calculate it again to pass as an argument for cgroup1_check_for_release(). Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
A cgroup can consume resources even after being deleted by a user. For example, writing back dirty pages should be accounted and limited, despite the corresponding cgroup might contain no processes and being deleted by a user. In the current implementation a cgroup can remain in such "dying" state for an undefined amount of time. For instance, if a memory cgroup contains a pge, mlocked by a process belonging to an other cgroup. Although the lifecycle of a dying cgroup is out of user's control, it's important to have some insight of what's going on under the hood. In particular, it's handy to have a counter which will allow to detect css leaks. To solve this problem, add a cgroup.stat interface to the base cgroup control files with the following metrics: nr_descendants total number of visible descendant cgroups nr_dying_descendants total number of dying descendant cgroups Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
Creating cgroup hierearchies of unreasonable size can affect overall system performance. A user might want to limit the size of cgroup hierarchy. This is especially important if a user is delegating some cgroup sub-tree. To address this issue, introduce an ability to control the size of cgroup hierarchy. The cgroup.max.descendants control file allows to set the maximum allowed number of descendant cgroups. The cgroup.max.depth file controls the maximum depth of the cgroup tree. Both are single value r/w files, with "max" default value. The control files exist on each hierarchy level (including root). When a new cgroup is created, we check the total descendants and depth limits on each level, and if none of them are exceeded, a new cgroup is created. Only alive cgroups are counted, removed (dying) cgroups are ignored. Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
由 Roman Gushchin 提交于
Keep track of the number of online and dying descent cgroups. This data will be used later to add an ability to control cgroup hierarchy (limit the depth and the number of descent cgroups) and display hierarchy stats. Signed-off-by: NRoman Gushchin <guro@fb.com> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan@huawei.com> Cc: Waiman Long <longman@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: kernel-team@fb.com Cc: cgroups@vger.kernel.org Cc: linux-doc@vger.kernel.org Cc: linux-kernel@vger.kernel.org
-
- 29 7月, 2017 2 次提交
-
-
由 Shaohua Li 提交于
By default we output cgroup id in blktrace. This adds an option to display cgroup path. Since get cgroup path is a relativly heavy operation, we don't enable it by default. with the option enabled, blktrace will output something like this: dd-1353 [007] d..2 293.015252: 8,0 /test/level D R 24 + 8 [dd] Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Shaohua Li 提交于
Now we have the facilities to implement exportfs operations. The idea is cgroup can export the fhandle info to userspace, then userspace uses fhandle to find the cgroup name. Another example is userspace can get fhandle for a cgroup and BPF uses the fhandle to filter info for the cgroup. Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 7月, 2017 2 次提交
-
-
由 Tejun Heo 提交于
Explain cgroup_enable_threaded() and note that the function can never be called on the root cgroup. Signed-off-by: NTejun Heo <tj@kernel.org> Suggested-by: NWaiman Long <longman@redhat.com>
-
由 Tejun Heo 提交于
cgroup_enable_threaded() checks that the cgroup doesn't have any tasks or children and fails the operation if so. This test is unnecessary because the first part is already checked by cgroup_can_be_thread_root() and the latter is unnecessary. The latter actually cause a behavioral oddity. Please consider the following hierarchy. All cgroups are domains. A / \ B C \ D If B is made threaded, C and D becomes invalid domains. Due to the no children restriction, threaded mode can't be enabled on C. For C and D, the only thing the user can do is removal. There is no reason for this restriction. Remove it. Acked-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 23 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
While refactoring, f7b2814b ("cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write()") broke error return value from the function. The return value from the last operation is always overridden to zero. Fix it. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.6+ Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 21 7月, 2017 6 次提交
-
-
由 Waiman Long 提交于
Update debug controller so that it prints out debug info about thread mode. 1) The relationship between proc_cset and threaded_csets are displayed. 2) The status of being a thread root or threaded cgroup is displayed. This patch is extracted from Waiman's larger patch. v2: - Removed [thread root] / [threaded] from debug.cgroup_css_links file as the same information is available from cgroup.type. Suggested by Waiman. - Threaded marking is moved to the previous patch. Patch-originally-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
This patch implements cgroup v2 thread support. The goal of the thread mode is supporting hierarchical accounting and control at thread granularity while staying inside the resource domain model which allows coordination across different resource controllers and handling of anonymous resource consumptions. A cgroup is always created as a domain and can be made threaded by writing to the "cgroup.type" file. When a cgroup becomes threaded, it becomes a member of a threaded subtree which is anchored at the closest ancestor which isn't threaded. The threads of the processes which are in a threaded subtree can be placed anywhere without being restricted by process granularity or no-internal-process constraint. Note that the threads aren't allowed to escape to a different threaded subtree. To be used inside a threaded subtree, a controller should explicitly support threaded mode and be able to handle internal competition in the way which is appropriate for the resource. The root of a threaded subtree, the nearest ancestor which isn't threaded, is called the threaded domain and serves as the resource domain for the whole subtree. This is the last cgroup where domain controllers are operational and where all the domain-level resource consumptions in the subtree are accounted. This allows threaded controllers to operate at thread granularity when requested while staying inside the scope of system-level resource distribution. As the root cgroup is exempt from the no-internal-process constraint, it can serve as both a threaded domain and a parent to normal cgroups, so, unlike non-root cgroups, the root cgroup can have both domain and threaded children. Internally, in a threaded subtree, each css_set has its ->dom_cset pointing to a matching css_set which belongs to the threaded domain. This ensures that thread root level cgroup_subsys_state for all threaded controllers are readily accessible for domain-level operations. This patch enables threaded mode for the pids and perf_events controllers. Neither has to worry about domain-level resource consumptions and it's enough to simply set the flag. For more details on the interface and behavior of the thread mode, please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added by this patch. v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create(). Spotted by Waiman. - Documentation updated as suggested by Waiman. - cgroup.type content slightly reformatted. - Mark the debug controller threaded. v4: - Updated to the general idea of marking specific cgroups domain/threaded as suggested by PeterZ. v3: - Dropped "join" and always make mixed children join the parent's threaded subtree. v2: - After discussions with Waiman, support for mixed thread mode is added. This should address the issue that Peter pointed out where any nesting should be avoided for thread subtrees while coexisting with other domain cgroups. - Enabling / disabling thread mode now piggy backs on the existing control mask update mechanism. - Bug fixes and cleanup. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
由 Tejun Heo 提交于
cgroup v2 is in the process of growing thread granularity support. Once thread mode is enabled, the root cgroup of the subtree serves as the dom_cgrp to which the processes of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. In the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch implements a new task iterator flag CSS_TASK_ITER_THREADED, which, when used on a dom_cgrp, makes the iteration include the tasks on all the associated threaded css_sets. "cgroup.procs" read path is updated to use it so that reading the file on a proc_cgrp lists all processes. This will also be used by controller implementations which need to walk processes or tasks at the resource domain level. Task iteration is implemented nested in css_set iteration. If CSS_TASK_ITER_THREADED is specified, after walking tasks of each !threaded css_set, all the associated threaded css_sets are visited before moving onto the next !threaded css_set. v2: ->cur_pcset renamed to ->cur_dcset. Updated for the new enable-threaded-per-cgroup behavior. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgroup v2 is in the process of growing thread granularity support. A threaded subtree is composed of a thread root and threaded cgroups which are proper members of the subtree. The root cgroup of the subtree serves as the domain cgroup to which the processes (as opposed to threads / tasks) of the subtree conceptually belong and domain-level resource consumptions not tied to any specific task are charged. Inside the subtree, threads won't be subject to process granularity or no-internal-task constraint and can be distributed arbitrarily across the subtree. This patch introduces cgroup->dom_cgrp along with threaded css_set handling. * cgroup->dom_cgrp points to self for normal and thread roots. For proper thread subtree members, points to the dom_cgrp (the thread root). * css_set->dom_cset points to self if for normal and thread roots. If threaded, points to the css_set which belongs to the cgrp->dom_cgrp. The dom_cgrp serves as the resource domain and keeps the matching csses available. The dom_cset holds those csses and makes them easily accessible. * All threaded csets are linked on their dom_csets to enable iteration of all threaded tasks. * cgroup->nr_threaded_children keeps track of the number of threaded children. This patch adds the above but doesn't actually use them yet. The following patches will build on top. v4: ->nr_threaded_children added. v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset. Updated for the new enable-threaded-per-cgroup behavior. v2: Added cgroup_is_threaded() helper. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
css_task_iter currently always walks all tasks. With the scheduled cgroup v2 thread support, the iterator would need to handle multiple types of iteration. As a preparation, add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS. If the flag is not specified, it walks all tasks as before. When asserted, the iterator only walks the group leaders. For now, the only user of the flag is cgroup v2 "cgroup.procs" file which no longer needs to skip non-leader tasks in cgroup_procs_next(). Note that cgroup v1 "cgroup.procs" can't use the group leader walk as v1 "cgroup.procs" doesn't mean "list all thread group leaders in the cgroup" but "list all thread group id's with any threads in the cgroup". While at it, update cgroup_procs_show() to use task_pid_vnr() instead of task_tgid_vnr(). As the iteration guarantees that the function only sees group leaders, this doesn't change the output and will allow sharing the function for thread iteration. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
Currently, writes "cgroup.procs" and "cgroup.tasks" files are all handled by __cgroup_procs_write() on both v1 and v2. This patch reoragnizes the write path so that there are common helper functions that different write paths use. While this somewhat increases LOC, the different paths are no longer intertwined and each path has more flexibility to implement different behaviors which will be necessary for the planned v2 thread support. v3: - Restructured so that cgroup_procs_write_permission() takes @src_cgrp and @dst_cgrp. v2: - Rolled in Waiman's task reference count fix. - Updated on top of nsdelegate changes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Waiman Long <longman@redhat.com>
-
- 19 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
On subsystem registration, css_populate_dir() is not called on the new root css, so the interface files for the subsystem on cgrp_dfl_root aren't created on registration. This is a residue from the days when cgrp_dfl_root was used only as the parking spot for unused subsystems, which no longer is true as it's used as the root for cgroup2. This is often fine as later operations tend to create them as a part of mount (cgroup1) or subtree_control operations (cgroup2); however, it's not difficult to mount cgroup2 with the controller interface files missing as Waiman found out. Fix it by invoking css_populate_dir() on the root css on subsys registration. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NWaiman Long <longman@redhat.com> Cc: stable@vger.kernel.org # v4.5+ Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 17 7月, 2017 3 次提交
-
-
由 Tejun Heo 提交于
Implement trivial cgroup_has_tasks() which tests whether cgrp->nr_populated_csets is zero and replace the explicit local populated test in cgroup_subtree_control(). This simplifies the code and cgroup_has_tasks() will be used in more places later. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgrp->populated_cnt counts both local (the cgroup's populated css_sets) and subtree proper (populated children) so that it's only zero when the whole subtree, including self, is empty. This patch splits the counter into two so that local and children populated states are tracked separately. It allows finer-grained tests on the state of the hierarchy which will be used to replace css_set walking local populated test. Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 08 7月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Subsystem migration methods shouldn't be called for empty migrations. cgroup_migrate_execute() implements this guarantee by bailing early if there are no source css_sets. This used to be correct before a79a908f ("cgroup: introduce cgroup namespaces"), but no longer since the commit because css_sets can stay pinned without tasks in them. This caused cgroup_migrate_execute() call into cpuset migration methods with an empty cgroup_taskset. cpuset migration methods correctly assume that cgroup_taskset_first() never returns NULL; however, due to the bug, it can, leading to the following oops. Unable to handle kernel paging request for data at address 0x00000960 Faulting instruction address: 0xc0000000001d6868 Oops: Kernel access of bad area, sig: 11 [#1] ... CPU: 14 PID: 16947 Comm: kworker/14:0 Tainted: G W 4.12.0-rc4-next-20170609 #2 Workqueue: events cpuset_hotplug_workfn task: c00000000ca60580 task.stack: c00000000c728000 NIP: c0000000001d6868 LR: c0000000001d6858 CTR: c0000000001d6810 REGS: c00000000c72b720 TRAP: 0300 Tainted: GW (4.12.0-rc4-next-20170609) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44722422 XER: 20000000 CFAR: c000000000008710 DAR: 0000000000000960 DSISR: 40000000 SOFTE: 1 GPR00: c0000000001d6858 c00000000c72b9a0 c000000001536e00 0000000000000000 GPR04: c00000000c72b9c0 0000000000000000 c00000000c72bad0 c000000766367678 GPR08: c000000766366d10 c00000000c72b958 c000000001736e00 0000000000000000 GPR12: c0000000001d6810 c00000000e749300 c000000000123ef8 c000000775af4180 GPR16: 0000000000000000 0000000000000000 c00000075480e9c0 c00000075480e9e0 GPR20: c00000075480e8c0 0000000000000001 0000000000000000 c00000000c72ba20 GPR24: c00000000c72baa0 c00000000c72bac0 c000000001407248 c00000000c72ba20 GPR28: c00000000141fc80 c00000000c72bac0 c00000000c6bc790 0000000000000000 NIP [c0000000001d6868] cpuset_can_attach+0x58/0x1b0 LR [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 Call Trace: [c00000000c72b9a0] [c0000000001d6858] cpuset_can_attach+0x48/0x1b0 (unreliable) [c00000000c72ba00] [c0000000001cbe80] cgroup_migrate_execute+0xb0/0x450 [c00000000c72ba80] [c0000000001d3754] cgroup_transfer_tasks+0x1c4/0x360 [c00000000c72bba0] [c0000000001d923c] cpuset_hotplug_workfn+0x86c/0xa20 [c00000000c72bca0] [c00000000011aa44] process_one_work+0x1e4/0x580 [c00000000c72bd30] [c00000000011ae78] worker_thread+0x98/0x5c0 [c00000000c72bdc0] [c000000000124058] kthread+0x168/0x1b0 [c00000000c72be30] [c00000000000b2e8] ret_from_kernel_thread+0x5c/0x74 Instruction dump: f821ffa1 7c7d1b78 60000000 60000000 38810020 7fa3eb78 3f42ffed 4bff4c25 60000000 3b5a0448 3d420020 eb610020 <e9230960> 7f43d378 e9290000 f92af200 ---[ end trace dcaaf98fb36d9e64 ]--- This patch fixes the bug by adding an explicit nr_tasks counter to cgroup_taskset and skipping calling the migration methods if the counter is zero. While at it, remove the now spurious check on no source css_sets. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-and-tested-by: NAbdul Haleem <abdhalee@linux.vnet.ibm.com> Cc: Roman Gushchin <guro@fb.com> Cc: stable@vger.kernel.org # v4.6+ Fixes: a79a908f ("cgroup: introduce cgroup namespaces") Link: http://lkml.kernel.org/r/1497266622.15415.39.camel@abdul.in.ibm.com
-
- 29 6月, 2017 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup only supports delegation to !root users and cgroup namespaces don't get any special treatments. This limits the usefulness of cgroup namespaces as they by themselves can't be safe delegation boundaries. A process inside a cgroup can change the resource control knobs of the parent in the namespace root and may move processes in and out of the namespace if cgroups outside its namespace are visible somehow. This patch adds a new mount option "nsdelegate" which makes cgroup namespaces delegation boundaries. If set, cgroup behaves as if write permission based delegation took place at namespace boundaries - writes to the resource control knobs from the namespace root are denied and migration crossing the namespace boundary aren't allowed from inside the namespace. This allows cgroup namespace to function as a delegation boundary by itself. v2: Silently ignore nsdelegate specified on !init mounts. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Aravind Anbudurai <aru7@fb.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Eric Biederman <ebiederm@xmission.com>
-
由 Tejun Heo 提交于
Restructure cgroup_procs_write_permission() to make extending permission logic easier. This patch doesn't cause any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 15 6月, 2017 1 次提交
-
-
由 Waiman Long 提交于
The reference count in the css_set data structure was used as a proxy of the number of tasks attached to that css_set. However, that count is actually not an accurate measure especially with thread mode support. So a new variable nr_tasks is added to the css_set to keep track of the actual task count. This new variable is protected by the css_set_lock. Functions that require the actual task count are updated to use the new variable. tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps. Refreshed on top of cgroup/for-v4.13 which dropped on css_set_populated() -> nr_tasks conversion. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 18 5月, 2017 1 次提交
-
-
由 Waiman Long 提交于
The kill_css() function may be called more than once under the condition that the css was killed but not physically removed yet followed by the removal of the cgroup that is hosting the css. This patch prevents any harmm from being done when that happens. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org # v4.5+
-
- 02 5月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
a590b90d ("cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc()") converted most cgroup_get() usages to cgroup_get_live() leaving cgroup_sk_alloc() the sole user of cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering unused warning for cgroup_get(). Silence the warning by adding __maybe_unused to cgroup_get(). Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>
-
- 29 4月, 2017 2 次提交
-
-
由 Zefan Li 提交于
Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two different superblocks") is broken. Now we try to fix the race by delaying the initialization of cgroup root refcnt until a superblock has been allocated. Reported-by: NDmitry Vyukov <dvyukov@google.com> Reported-by: NAndrei Vagin <avagin@virtuozzo.com> Tested-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
cgroup_get() expected to be called only on live cgroups and triggers warning on a dead cgroup; however, cgroup_sk_alloc() may be called while cloning a socket which is left in an empty and removed cgroup and thus may legitimately duplicate its reference on a dead cgroup. This currently triggers the following warning spuriously. WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60 ... [<ffffffff8107e123>] __warn+0xd3/0xf0 [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20 [<ffffffff810ff465>] cgroup_get+0x55/0x60 [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0 [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390 [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0 [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0 [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670 [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0 [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00 [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0 [<ffffffff81837cb2>] ip6_input+0x32/0xb0 [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0 [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0 [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0 [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70 [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80 [<ffffffff817787d8>] napi_gro_frags+0x208/0x270 [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40 [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90 [<ffffffff81778b30>] net_rx_action+0x210/0x350 [<ffffffff8188c426>] __do_softirq+0x106/0x2c7 [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0 [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI> [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20 [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0 [<ffffffff8103edd1>] start_secondary+0xf1/0x100 This patch renames the existing cgroup_get() with the dead cgroup warning to cgroup_get_live() after cgroup_kn_lock_live() and introduces the new cgroup_get() which doesn't check whether the cgroup is live or dead. All existing cgroup_get() users except for cgroup_sk_alloc() are converted to use cgroup_get_live(). Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets") Cc: stable@vger.kernel.org # v4.5+ Cc: Johannes Weiner <hannes@cmpxchg.org> Reported-by: NChris Mason <clm@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 17 3月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Creation of a kthread goes through a couple interlocked stages between the kthread itself and its creator. Once the new kthread starts running, it initializes itself and wakes up the creator. The creator then can further configure the kthread and then let it start doing its job by waking it up. In this configuration-by-creator stage, the creator is the only one that can wake it up but the kthread is visible to userland. When altering the kthread's attributes from userland is allowed, this is fine; however, for cases where CPU affinity is critical, kthread_bind() is used to first disable affinity changes from userland and then set the affinity. This also prevents the kthread from being migrated into non-root cgroups as that can affect the CPU affinity and many other things. Unfortunately, the cgroup side of protection is racy. While the PF_NO_SETAFFINITY flag prevents further migrations, userland can win the race before the creator sets the flag with kthread_bind() and put the kthread in a non-root cgroup, which can lead to all sorts of problems including incorrect CPU affinity and starvation. This bug got triggered by userland which periodically tries to migrate all processes in the root cpuset cgroup to a non-root one. Per-cpu workqueue workers got caught while being created and ended up with incorrected CPU affinity breaking concurrency management and sometimes stalling workqueue execution. This patch adds task->no_cgroup_migration which disallows the task to be migrated by userland. kthreadd starts with the flag set making every child kthread start in the root cgroup with migration disallowed. The flag is cleared after the kthread finishes initialization by which time PF_NO_SETAFFINITY is set if the kthread should stay in the root cgroup. It'd be better to wait for the initialization instead of failing but I couldn't think of a way of implementing that without adding either a new PF flag, or sleeping and retrying from waiting side. Even if userland depends on changing cgroup membership of a kthread, it either has to be synchronized with kthread_create() or periodically repeat, so it's unlikely that this would break anything. v2: Switch to a simpler implementation using a new task_struct bit field suggested by Oleg. Signed-off-by: NTejun Heo <tj@kernel.org> Suggested-by: NOleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Reported-and-debugged-by: NChris Mason <clm@fb.com> Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3) Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 10 3月, 2017 1 次提交
-
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: disble||disable disbled||disabled I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c untouched. The macro is not referenced at all, but this commit is touching only comment blocks just in case. Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2017 1 次提交
-
-
由 Elena Reshetova 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 07 3月, 2017 1 次提交
-
-
由 Elena Reshetova 提交于
refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NDavid Windsor <dwindsor@gmail.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 02 3月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
We are going to split <linux/sched/task.h> out of <linux/sched.h>, which will have to be picked up from other headers and a couple of .c files. Create a trivial placeholder <linux/sched/task.h> file that just maps to <linux/sched.h> to make this patch obviously correct and bisectable. Include the new header in the files that are going to need it. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 2月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
Along with the write access to the cgroup.procs or tasks file, cgroup has required the writer's euid, unless root, to match [s]uid of the target process or task. On cgroup v1, this is necessary because there's nothing preventing a delegatee from pulling in tasks or processes from all over the system. If a user has a cgroup subdirectory delegated to it, the user would have write access to the cgroup.procs or tasks file. If there are no further checks than file write access check, the user would be able to pull processes from all over the system into its subhierarchy which is clearly not the intended behavior. The matching [s]uid requirement partially prevents this problem by allowing a delegatee to pull in the processes that belongs to it. This isn't a sufficient protection however, because a user would still be able to jump processes across two disjoint sub-hierarchies that has been delegated to them. cgroup v2 resolves the issue by requiring the writer to have access to the common ancestor of the cgroup.procs file of the source and target cgroups. This confines each delegatee to their own sub-hierarchy proper and bases all permission decisions on the cgroup filesystem rather than having to pull in explicit uid matching. cgroup v2 has still been applying the matching [s]uid requirement just for historical reasons. On cgroup2, the requirement doesn't serve any purpose while unnecessarily complicating the permission model. Let's drop it. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 31 1月, 2017 1 次提交
-
-
由 Tejun Heo 提交于
* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other ss_masks. Make it a u16. * Move have_canfork_callback together with other callback ss_masks. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 16 1月, 2017 2 次提交
-
-
由 Tejun Heo 提交于
Currently, subsys->*attach() callbacks are called for all subsystems which are attached to the hierarchy on which the migration is taking place. With cgroup_migrate_prepare_dst() filtering out identity migrations, v1 hierarchies can avoid spurious ->*attach() callback invocations where the source and destination csses are identical; however, this isn't enough on v2 as only a subset of the attached controllers can be affected on controller enable/disable. While spurious ->*attach() invocations aren't critically broken, they're unnecessary overhead and can lead to temporary overcharges on certain controllers. Fix it by tracking which subsystems are affected by a migration and invoking ->*attach() callbacks only on those subsystems. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
cgroup migration is performed in four steps - css_set preloading, addition of target tasks, actual migration, and clean up. A list named preloaded_csets is used to track the preloading. This is a bit too restricted and the code is already depending on the subtlety that all source css_sets appear before destination ones. Let's create struct cgroup_mgctx which keeps track of everything during migration. Currently, it has separate preload lists for source and destination csets and also embeds cgroup_taskset which is used during the actual migration. This moves struct cgroup_taskset definition to cgroup-internal.h. This patch doesn't cause any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-