- 16 10月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
Currently, cgroup->nr_populated counts whether the cgroup has any css_sets linked to it and the number of children which has non-zero ->nr_populated. This works because a css_set's refcnt converges with the number of tasks linked to it and thus there's no css_set linked to a cgroup if it doesn't have any live tasks. To help tracking resource usage of zombie tasks, putting the ref of css_set will be separated from disassociating the task from the css_set which means that a cgroup may have css_sets linked to it even when it doesn't have any live tasks. This patch updates cgroup->nr_populated so that for the cgroup itself it counts the number of css_sets which have tasks associated with them so that empty css_sets don't skew the populated test. Signed-off-by: NTejun Heo <tj@kernel.org> -
由 Tejun Heo 提交于
cgroup_task_migrate() no longer uses @old_cgrp. Remove it. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 26 9月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
49d1dc4b ("cgroup: implement static_key based cgroup_subsys_enabled() and cgroup_subsys_on_dfl()") converted cgroup enabled test to use static_key; however, cgroup_disable() is called before static_key subsystem itself is initialized and thus leads to the following warning when "cgroup_disable=" parameter is specified. WARNING: CPU: 0 PID: 0 at kernel/jump_label.c:99 static_key_slow_dec+0x44/0x60() static_key_slow_dec used before call to jump_label_init ... Call Trace: [<ffffffff813b18c2>] dump_stack+0x44/0x62 [<ffffffff8108dd52>] warn_slowpath_common+0x82/0xc0 [<ffffffff8108ddec>] warn_slowpath_fmt+0x5c/0x80 [<ffffffff8119c054>] static_key_slow_dec+0x44/0x60 [<ffffffff81d826b6>] cgroup_disable+0xaf/0xd6 [<ffffffff81d5f9de>] unknown_bootoption+0x8c/0x194 [<ffffffff810b0c03>] parse_args+0x273/0x4a0 [<ffffffff81d5fd67>] start_kernel+0x205/0x4b8 ... Fix it by making cgroup_disable() to record the subsystems to disable in cgroup_disable_mask and moving the actual application to cgroup_init() which is late enough and where the enabled state is first used. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NAndrey Wagin <avagin@gmail.com> Link: http://lkml.kernel.org/g/CANaxB-yFuS4SA2znSvcKrO9L_CbHciHYW+o9bN8sZJ8eR9FxYA@mail.gmail.com Fixes: 49d1dc4b
-
- 23 9月, 2015 4 次提交
-
-
由 Tejun Heo 提交于
cgroup_update_dfl_csses() is responsible for migrating processes when controllers are enabled or disabled on the default hierarchy. As the css association changes for all the processes in the affected cgroups, this involves migrating multiple processes. Up until now, it was implemented by migrating process-by-process until the source css_sets are empty; however, this means that if a process fails to migrate after some succeed before it, the recovery is very tricky. This was considered okay as subsystems weren't allowed to reject process migration on the default hierarchy; unfortunately, enforcing this policy turned out to be problematic for certain types of resources - realtime slices for now. As such, the default hierarchy is gonna allow restricted failures during migration and to support that this patch makes cgroup_update_dfl_csses() migrate all target processes atomically rather than one-by-one. The preceding patches made subsystems ready for multi-process migration and factored out taskset operations making this almost trivial. All tasks of the target processes are put in the same taskset and the migration operations are performed once which either fails or succeeds for all. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
Currently, cgroup_migreate() implements large part of the migration logic inline including building the target taskset and actually migrating them. This patch separates out the following taskset operations. CGROUP_TASKSET_INIT() : taskset initializer cgroup_taskset_add() : add a task to a taskset cgroup_taskset_migrate() : migrate a taskset to the destination cgroup This will be used to implement atomic multi-process migration in cgroup_update_dfl_csses(). This is pure reorganization which doesn't introduce any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
cgroup_migrate() has the destination cgroup as the first parameter while cgroup_task_migrate() has the destination cset as the last. Another migration function is scheduled to be added which can make the discrepancy further stand out. Let's reorder cgroup_migrate()'s parameters so that the destination cgroup is the last. This doesn't cause any functional difference. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
由 Tejun Heo 提交于
It wasn't explicitly documented but, when a process is being migrated, cpuset and memcg depend on cgroup_taskset_first() returning the threadgroup leader; however, this approach is somewhat ghetto and would no longer work for the planned multi-process migration. This patch introduces explicit cgroup_taskset_for_each_leader() which iterates over only the threadgroup leaders and replaces cgroup_taskset_first() usages for accessing the leader with it. This prepares both memcg and cpuset for multi-process migration. This patch also updates the documentation for cgroup_taskset_for_each() to clarify the iteration rules and removes comments mentioning task ordering in tasksets. v2: A previous patch which added threadgroup leader test was dropped. Patch updated accordingly. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
- 19 9月, 2015 7 次提交
-
-
由 Tejun Heo 提交于
cgroup core handles creations and removals of cgroup interface files as described by cftypes. There are cases where the handle for a given file instance is necessary, for example, to generate a file modified event. Currently, this is handled by explicitly matching the callback method pointer and storing the file handle manually in cgroup_add_file(). While this simple approach works for cgroup core files, it can't for controller interface files. This patch generalizes cgroup interface file handle handling. struct cgroup_file is defined and each cftype can optionally tell cgroup core to store the file handle by setting ->file_offset. A file handle remains accessible as long as the containing css is accessible. Both "cgroup.procs" and "cgroup.events" are converted to use the new generic mechanism instead of hooking directly into cgroup_add_file(). Also, cgroup_file_notify() which takes a struct cgroup_file and generates a file modified event on it is added and replaces explicit kernfs_notify() invocations. This generalizes cgroup file handle handling and allows controllers to generate file modified notifications. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
The file creation / removal path has always been a bit icky and the planned notification update requires css during file creation. Restructure as follows. * cgroup_addrm_files() now takes both @css and @cgrp and is only called directly by other file handling functions. * cgroup_populate/clear_dir() are replaced with css_populate/clear_dir() taking @css and @cgrp_override. @cgrp_override is used only when files needs to be created on / removed from a cgroup which isn't attached to @css which happens during subsystem rebinds. Subsystem loops are moved to the callers. * cgroup_add_file() now takes both @css and @cgrp. @css isn't used yet but will be used by the planned notification update. This patch doens't cause any behavior changes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
* Use local variables @scgrp and @dcgrp for @src_root->cgrp and @dst_root->cgrp respectively. * Use initializers to set @src_root and @css in the inner bind loop. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
After a file creation failure, cgroup_addrm_files() it didn't remove the files which had already been created. When cgroup_populate_dir() is the caller, this is fine as the caller performs cleanup; however, for other callers, this may leave unactivated dangling files behind. As kernfs directory removals are recursive, this doesn't lead to permanent memory leak but it can, for example, fail future attempts to create those files again. There's no point in keeping around this sort of subtlety and it gets in the way of planned updates to file handling. This patch makes cgroup_addrm_files() clean up after itself on failures. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
Move it upwards so that it's right below cgroup_clear_dir() and the forward declaration is unnecessary. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
cftype->mode allows controllers to give arbitrary permissions to interface knobs. Except for "cgroup.event_control", the existing uses are spurious. * Some explicitly specify S_IRUGO | S_IWUSR even though that's the default. * "cpuset.memory_pressure" specifies S_IRUGO while also setting a write callback which returns -EACCES. All it needs to do is simply not setting a write callback. "cgroup.event_control" uses cftype->mode to make the file world-writable. It's a misdesigned interface and we don't want controllers to be tweaking interface file permissions in general. This patch removes cftype->mode and all its spurious uses and implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is marked as compatibility-only. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> -
由 Tejun Heo 提交于
memcg already uses "memory.events" for event reporting and other controllers may need event reporting too. Let's standardize on "$SUBSYS.events" interface file for reporting events which don't happen too frequently and thus can share event notification. "cgroup.populated" is replaced with "populated" field in "cgroup.events" and documentation is updated accordingly. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
- 18 9月, 2015 3 次提交
-
-
由 Tejun Heo 提交于
cgroup_on_dfl() tests whether the cgroup's root is the default hierarchy; however, an individual controller is only interested in whether the controller is attached to the default hierarchy and never tests a cgroup which doesn't belong to the hierarchy that the controller is attached to. This patch replaces cgroup_on_dfl() tests in controllers with faster static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as the only user of cgroup_on_dfl() and the function is moved from the header file to cgroup.c. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
-
由 Tejun Heo 提交于
Replace cgroup_subsys->disabled tests in controllers with cgroup_subsys_enabled(). cgroup_subsys_enabled() requires literal subsys name as its parameter and thus can't be used for cgroup core which iterates through controllers. For cgroup core, introduce and use cgroup_ssid_enabled() which uses slower static_key_enabled() test and can be indexed by subsys ID. This leaves cgroup_subsys->disabled unused. Removed. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org>
-
由 Tejun Heo 提交于
Whether a subsys is enabled and attached to the default hierarchy seldom changes and may be tested in the hot paths. This patch implements static_key based cgroup_subsys_enabled() and cgroup_subsys_on_dfl() tests. The following patches will update the users and remove duplicate mechanisms. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NZefan Li <lizefan@huawei.com>
-
- 17 9月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
Note: This commit was originally committed as b5ba75b5 but got reverted by f9f9e7b7 due to the performance regression from the percpu_rwsem write down/up operations added to cgroup task migration path. percpu_rwsem changes which alleviate the performance issue are pending for v4.4-rc1 merge window. Re-apply. Now that threadgroup locking is made global, code paths around it can be simplified. * lock-verify-unlock-retry dancing removed from __cgroup_procs_write(). * Race protection against de_thread() removed from cgroup_update_dfl_csses(). Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
-
由 Tejun Heo 提交于
Note: This commit was originally committed as d59cfc09 but got reverted by 0c986253 due to the performance regression from the percpu_rwsem write down/up operations added to cgroup task migration path. percpu_rwsem changes which alleviate the performance issue are pending for v4.4-rc1 merge window. Re-apply. The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 16 9月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
This reverts commit d59cfc09. d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify threadgroup locking") changed how cgroup synchronizes against task fork and exits so that it uses global percpu_rwsem instead of per-process rwsem; unfortunately, the write [un]lock paths of percpu_rwsem always involve synchronize_rcu_expedited() which turned out to be too expensive. Improvements for percpu_rwsem are scheduled to be merged in the coming v4.4-rc1 merge window which alleviates this issue. For now, revert the two commits to restore per-process rwsem. They will be re-applied for the v4.4-rc1 merge window. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org # v4.2+
-
由 Tejun Heo 提交于
This reverts commit b5ba75b5. d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify threadgroup locking") changed how cgroup synchronizes against task fork and exits so that it uses global percpu_rwsem instead of per-process rwsem; unfortunately, the write [un]lock paths of percpu_rwsem always involve synchronize_rcu_expedited() which turned out to be too expensive. Improvements for percpu_rwsem are scheduled to be merged in the coming v4.4-rc1 merge window which alleviates this issue. For now, revert the two commits to restore per-process rwsem. They will be re-applied for the v4.4-rc1 merge window. Signed-off-by: NTejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org # v4.2+
-
- 09 9月, 2015 1 次提交
-
-
由 Kees Cook 提交于
When seq_show_option (commit a068acf2: "fs: create and use seq_show_option for escaping") was merged, it did not correctly collide with cgroup's addition of legacy_name (commit 3e1d2eed: "cgroup: introduce cgroup_subsys->legacy_name") changes. This fixes the reported name. Signed-off-by: NKees Cook <keescook@chromium.org> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 9月, 2015 1 次提交
-
-
由 Kees Cook 提交于
Many file systems that implement the show_options hook fail to correctly escape their output which could lead to unescaped characters (e.g. new lines) leaking into /proc/mounts and /proc/[pid]/mountinfo files. This could lead to confusion, spoofed entries (resulting in things like systemd issuing false d-bus "mount" notifications), and who knows what else. This looks like it would only be the root user stepping on themselves, but it's possible weird things could happen in containers or in other situations with delegated mount privileges. Here's an example using overlay with setuid fusermount trusting the contents of /proc/mounts (via the /etc/mtab symlink). Imagine the use of "sudo" is something more sneaky: $ BASE="ovl" $ MNT="$BASE/mnt" $ LOW="$BASE/lower" $ UP="$BASE/upper" $ WORK="$BASE/work/ 0 0 none /proc fuse.pwn user_id=1000" $ mkdir -p "$LOW" "$UP" "$WORK" $ sudo mount -t overlay -o "lowerdir=$LOW,upperdir=$UP,workdir=$WORK" none /mnt $ cat /proc/mounts none /root/ovl/mnt overlay rw,relatime,lowerdir=ovl/lower,upperdir=ovl/upper,workdir=ovl/work/ 0 0 none /proc fuse.pwn user_id=1000 0 0 $ fusermount -u /proc $ cat /proc/mounts cat: /proc/mounts: No such file or directory This fixes the problem by adding new seq_show_option and seq_show_option_n helpers, and updating the vulnerable show_option handlers to use them as needed. Some, like SELinux, need to be open coded due to unusual existing escape mechanisms. [akpm@linux-foundation.org: add lost chunk, per Kees] [keescook@chromium.org: seq_show_option should be using const parameters] Signed-off-by: NKees Cook <keescook@chromium.org> Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Acked-by: NJan Kara <jack@suse.com> Acked-by: NPaul Moore <paul@paul-moore.com> Cc: J. R. Okajima <hooanon05g@gmail.com> Signed-off-by: NKees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 8月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
This allows cgroup subsystems to use a different name on the unified hierarchy. cgroup_subsys->name is used on the unified hierarchy, ->legacy_name elsewhere. If ->legacy_name is not explicitly set, it's automatically set to ->name and the userland visible behavior remains unchanged. v2: Make parse_cgroupfs_options() only consider ->legacy_name as mount options are used only on legacy hierarchies. Suggested by Li Zefan. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org -
由 Tejun Heo 提交于
It doesn't make sense to print subsystems on mount option or /proc/PID/cgroup for the default hierarchy. * cgroup.controllers file at the root of the default hierarchy lists the currently attached controllers. * The default hierarchy is catch-all for unmounted subsystems. * The default hierarchy doesn't accept any mount options. Suppress subsystem printing on mount options and /proc/PID/cgroup for the default hierarchy. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: cgroups@vger.kernel.org
-
- 06 8月, 2015 1 次提交
-
-
由 Tejun Heo 提交于
While cgroup subsystems can't be modules, blkcg supports dynamically loadable policies which interact with cgroup core. Export cgrp_dfl_root so that cgroup_on_dfl() can be used in those modules. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org>
-
- 03 8月, 2015 1 次提交
-
-
由 Vladimir Davydov 提交于
It does not make much sense to call idr_preload with the same gfp mask as the following idr_alloc, but this is what we do in cgroup_idr_alloc. This patch fixes the idr_preload usage by making cgroup_idr_alloc call idr_alloc w/o __GFP_WAIT. Since it is now safe to call cgroup_idr_alloc with GFP_KERNEL, the patch also fixes all its callers appropriately. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 23 7月, 2015 1 次提交
-
-
由 Paul E. McKenney 提交于
This commit renames rcu_lockdep_assert() to RCU_LOCKDEP_WARN() for consistency with the WARN() series of macros. This also requires inverting the sense of the conditional, which this commit also does. Reported-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: NIngo Molnar <mingo@kernel.org>
-
- 15 7月, 2015 1 次提交
-
-
由 Aleksa Sarai 提交于
Add a new cgroup subsystem callback can_fork that conditionally states whether or not the fork is accepted or rejected by a cgroup policy. In addition, add a cancel_fork callback so that if an error occurs later in the forking process, any state modified by can_fork can be reverted. Allow for a private opaque pointer to be passed from cgroup_can_fork to cgroup_post_fork, allowing for the fork state to be stored by each subsystem separately. Also add a tagging system for cgroup_subsys.h to allow for CGROUP_<TAG> enumerations to be be defined and used. In addition, explicitly add a CGROUP_CANFORK_COUNT macro to make arrays easier to define. This is in preparation for implementing the pids cgroup subsystem. Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 01 7月, 2015 1 次提交
-
-
由 Eric W. Biederman 提交于
This allows for better documentation in the code and it allows for a simpler and fully correct version of fs_fully_visible to be written. The mount points converted and their filesystems are: /sys/hypervisor/s390/ s390_hypfs /sys/kernel/config/ configfs /sys/kernel/debug/ debugfs /sys/firmware/efi/efivars/ efivarfs /sys/fs/fuse/connections/ fusectl /sys/fs/pstore/ pstore /sys/kernel/tracing/ tracefs /sys/fs/cgroup/ cgroup /sys/kernel/security/ securityfs /sys/fs/selinux/ selinuxfs /sys/fs/smackfs/ smackfs Cc: stable@vger.kernel.org Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 19 6月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
On traditional hierarchies, if a task has write access to "tasks" or "cgroup.procs" file of a cgroup and its euid agrees with the target, it can move the target to the cgroup; however, consider the following scenario. The owner of each cgroup is in the parentheses. R (root) - 0 (root) - 00 (user1) - 000 (user1) | \ 001 (user1) \ 1 (root) - 10 (user1) The subtrees of 00 and 10 are delegated to user1; however, while both subtrees may belong to the same user, it is clear that the two subtrees are to be isolated - they're under completely separate resource limits imposed by 0 and 1, respectively. Note that 0 and 1 aren't strictly necessary but added to ease illustrating the issue. If user1 is allowed to move processes between the two subtrees, the intention of the hierarchy - keeping a given group of processes under a subtree with certain resource restrictions while delegating management of the subtree - can be circumvented by user1. This happens because migration permission check doesn't consider the hierarchical nature of cgroups. To fix the issue, this patch adds an extra permission requirement when userland tries to migrate a process in the default hierarchy - the issuing task must have write access to the common ancestor of "cgroup.procs" file of the ancestor in addition to the destination's. Conceptually, the issuer must be able to move the target process from the source cgroup to the common ancestor of source and destination cgroups and then to the destination. As long as delegation is done in a proper top-down way, this guarantees that a delegatee can't smuggle processes across disjoint delegation domains. The next patch will add documentation on the delegation model on the default hierarchy. v2: Fixed missing !ret test. Spotted by Li Zefan. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Li Zefan <lizefan@huawei.com> -
由 Tejun Heo 提交于
Separate out task / process migration permission check from __cgroup_procs_write() into cgroup_procs_write_permission(). * Permission check is moved right above the actual migration and no longer performed while holding rcu_read_lock(). cgroup_procs_write_permission() uses get_task_cred() / put_cred() instead of __task_cred(). Also, !root trying to migrate kthreadd or PF_NO_SETAFFINITY tasks will now fail with -EINVAL rather than -EACCES which should be fine. * The same permission check is now performed even when moving self by specifying 0 as pid. This always succeeds so there's no functional difference. We'll add more permission checks later and the benefits of keeping both cases consistent outweigh the minute overhead of doing perm checks on pid 0 case. Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 10 6月, 2015 1 次提交
-
-
由 Aleksa Sarai 提交于
Fix the fact that @ssid is uninitialised in the case where CGROUP_SUBSYS_COUNT = 0 by setting ssid to 0. Fixes: cb4a3167 ("cgroup: use bitmask to filter for_each_subsys") Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 08 6月, 2015 2 次提交
-
-
由 Aleksa Sarai 提交于
Replace the explicit checking against ss_masks inside a for_each_subsys block with for_each_subsys_which(..., ss_mask), to take advantage of the more readable (and more efficient) macro. Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> -
由 Aleksa Sarai 提交于
Add a new macro for_each_subsys_which that allows all enabled cgroup subsystems to be filtered by a bitmask, such that mask & (1 << ssid) determines if the subsystem is to be processed in the loop body (where ssid is the unique id of the subsystem). Also replace the need_forkexit_callback with two separate bitmasks for each callback to make (ss->{fork,exit}) checks unnecessary. tj: add a short comment for "if (!CGROUP_SUBSYS_COUNT)". Signed-off-by: NAleksa Sarai <cyphar@cyphar.com>
-
- 27 5月, 2015 3 次提交
-
-
由 Tejun Heo 提交于
Now that threadgroup locking is made global, code paths around it can be simplified. * lock-verify-unlock-retry dancing removed from __cgroup_procs_write(). * Race protection against de_thread() removed from cgroup_update_dfl_csses(). Signed-off-by: NTejun Heo <tj@kernel.org> -
由 Tejun Heo 提交于
The cgroup side of threadgroup locking uses signal_struct->group_rwsem to synchronize against threadgroup changes. This per-process rwsem adds small overhead to thread creation, exit and exec paths, forces cgroup code paths to do lock-verify-unlock-retry dance in a couple places and makes it impossible to atomically perform operations across multiple processes. This patch replaces signal_struct->group_rwsem with a global percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader side and contained in cgroups proper. This patch converts one-to-one. This does make writer side heavier and lower the granularity; however, cgroup process migration is a fairly cold path, we do want to optimize thread operations over it and cgroup migration operations don't take enough time for the lower granularity to matter. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> -
由 Tejun Heo 提交于
threadgroup_change_begin/end() are used to mark the beginning and end of threadgroup modifying operations to allow code paths which require a threadgroup to stay stable across blocking operations to synchronize against those sections using threadgroup_lock/unlock(). It's currently implemented as a general mechanism in sched.h using per-signal_struct rwsem; however, this never grew non-cgroup use cases and becomes noop if !CONFIG_CGROUPS. It turns out that cgroups is gonna be better served with a different sycnrhonization scheme and is a bit silly to keep cgroups specific details as a general mechanism. What's general here is identifying the places where threadgroups are modified. This patch restructures threadgroup locking so that threadgroup_change_begin/end() become a place where subsystems which need to sycnhronize against threadgroup changes can hook into. cgroup_threadgroup_change_begin/end() which operate on the per-signal_struct rwsem are created and threadgroup_lock/unlock() are moved to cgroup.c and made static. This is pure reorganization which doesn't cause any functional changes. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org>
-
- 19 5月, 2015 1 次提交
-
-
由 Aleksa Sarai 提交于
Switch the type of all internal cgroup masks to (unsigned long), which is the correct type for bitmasks. This is in preparation for the for_each_subsys_which patch. Signed-off-by: NAleksa Sarai <cyphar@cyphar.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 23 4月, 2015 1 次提交
-
-
由 Chen Hanxiao 提交于
s/effctive/effective s/hierarhcy/hierarchy s/shoulid/should Signed-off-by: NChen Hanxiao <chenhanxiao@cn.fujitsu.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-