diff --git a/Documentation/cgroups/unified-hierarchy.txt b/Documentation/cgroups/unified-hierarchy.txt new file mode 100644 index 0000000000000000000000000000000000000000..324b182e6000823eda8fa22924173ee957c32647 --- /dev/null +++ b/Documentation/cgroups/unified-hierarchy.txt @@ -0,0 +1,359 @@ + +Cgroup unified hierarchy + +April, 2014 Tejun Heo + +This document describes the changes made by unified hierarchy and +their rationales. It will eventually be merged into the main cgroup +documentation. + +CONTENTS + +1. Background +2. Basic Operation + 2-1. Mounting + 2-2. cgroup.subtree_control + 2-3. cgroup.controllers +3. Structural Constraints + 3-1. Top-down + 3-2. No internal tasks +4. Other Changes + 4-1. [Un]populated Notification + 4-2. Other Core Changes + 4-3. Per-Controller Changes + 4-3-1. blkio + 4-3-2. cpuset + 4-3-3. memory +5. Planned Changes + 5-1. CAP for resource control + + +1. Background + +cgroup allows an arbitrary number of hierarchies and each hierarchy +can host any number of controllers. While this seems to provide a +high level of flexibility, it isn't quite useful in practice. + +For example, as there is only one instance of each controller, utility +type controllers such as freezer which can be useful in all +hierarchies can only be used in one. The issue is exacerbated by the +fact that controllers can't be moved around once hierarchies are +populated. Another issue is that all controllers bound to a hierarchy +are forced to have exactly the same view of the hierarchy. It isn't +possible to vary the granularity depending on the specific controller. + +In practice, these issues heavily limit which controllers can be put +on the same hierarchy and most configurations resort to putting each +controller on its own hierarchy. Only closely related ones, such as +the cpu and cpuacct controllers, make sense to put on the same +hierarchy. This often means that userland ends up managing multiple +similar hierarchies repeating the same steps on each hierarchy +whenever a hierarchy management operation is necessary. + +Unfortunately, support for multiple hierarchies comes at a steep cost. +Internal implementation in cgroup core proper is dazzlingly +complicated but more importantly the support for multiple hierarchies +restricts how cgroup is used in general and what controllers can do. + +There's no limit on how many hierarchies there may be, which means +that a task's cgroup membership can't be described in finite length. +The key may contain any varying number of entries and is unlimited in +length, which makes it highly awkward to handle and leads to addition +of controllers which exist only to identify membership, which in turn +exacerbates the original problem. + +Also, as a controller can't have any expectation regarding what shape +of hierarchies other controllers would be on, each controller has to +assume that all other controllers are operating on completely +orthogonal hierarchies. This makes it impossible, or at least very +cumbersome, for controllers to cooperate with each other. + +In most use cases, putting controllers on hierarchies which are +completely orthogonal to each other isn't necessary. What usually is +called for is the ability to have differing levels of granularity +depending on the specific controller. In other words, hierarchy may +be collapsed from leaf towards root when viewed from specific +controllers. For example, a given configuration might not care about +how memory is distributed beyond a certain level while still wanting +to control how CPU cycles are distributed. + +Unified hierarchy is the next version of cgroup interface. It aims to +address the aforementioned issues by having more structure while +retaining enough flexibility for most use cases. Various other +general and controller-specific interface issues are also addressed in +the process. + + +2. Basic Operation + +2-1. Mounting + +Currently, unified hierarchy can be mounted with the following mount +command. Note that this is still under development and scheduled to +change soon. + + mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT + +All controllers which are not bound to other hierarchies are +automatically bound to unified hierarchy and show up at the root of +it. Controllers which are enabled only in the root of unified +hierarchy can be bound to other hierarchies at any time. This allows +mixing unified hierarchy with the traditional multiple hierarchies in +a fully backward compatible way. + + +2-2. cgroup.subtree_control + +All cgroups on unified hierarchy have a "cgroup.subtree_control" file +which governs which controllers are enabled on the children of the +cgroup. Let's assume a hierarchy like the following. + + root - A - B - C + \ D + +root's "cgroup.subtree_control" file determines which controllers are +enabled on A. A's on B. B's on C and D. This coincides with the +fact that controllers on the immediate sub-level are used to +distribute the resources of the parent. In fact, it's natural to +assume that resource control knobs of a child belong to its parent. +Enabling a controller in a "cgroup.subtree_control" file declares that +distribution of the respective resources of the cgroup will be +controlled. Note that this means that controller enable states are +shared among siblings. + +When read, the file contains a space-separated list of currently +enabled controllers. A write to the file should contain a +space-separated list of controllers with '+' or '-' prefixed (without +the quotes). Controllers prefixed with '+' are enabled and '-' +disabled. If a controller is listed multiple times, the last entry +wins. The specific operations are executed atomically - either all +succeed or fail. + + +2-3. cgroup.controllers + +Read-only "cgroup.controllers" file contains a space-separated list of +controllers which can be enabled in the cgroup's +"cgroup.subtree_control" file. + +In the root cgroup, this lists controllers which are not bound to +other hierarchies and the content changes as controllers are bound to +and unbound from other hierarchies. + +In non-root cgroups, the content of this file equals that of the +parent's "cgroup.subtree_control" file as only controllers enabled +from the parent can be used in its children. + + +3. Structural Constraints + +3-1. Top-down + +As it doesn't make sense to nest control of an uncontrolled resource, +all non-root "cgroup.subtree_control" files can only contain +controllers which are enabled in the parent's "cgroup.subtree_control" +file. A controller can be enabled only if the parent has the +controller enabled and a controller can't be disabled if one or more +children have it enabled. + + +3-2. No internal tasks + +One long-standing issue that cgroup faces is the competition between +tasks belonging to the parent cgroup and its children cgroups. This +is inherently nasty as two different types of entities compete and +there is no agreed-upon obvious way to handle it. Different +controllers are doing different things. + +The cpu controller considers tasks and cgroups as equivalents and maps +nice levels to cgroup weights. This works for some cases but falls +flat when children should be allocated specific ratios of CPU cycles +and the number of internal tasks fluctuates - the ratios constantly +change as the number of competing entities fluctuates. There also are +other issues. The mapping from nice level to weight isn't obvious or +universal, and there are various other knobs which simply aren't +available for tasks. + +The blkio controller implicitly creates a hidden leaf node for each +cgroup to host the tasks. The hidden leaf has its own copies of all +the knobs with "leaf_" prefixed. While this allows equivalent control +over internal tasks, it's with serious drawbacks. It always adds an +extra layer of nesting which may not be necessary, makes the interface +messy and significantly complicates the implementation. + +The memory controller currently doesn't have a way to control what +happens between internal tasks and child cgroups and the behavior is +not clearly defined. There have been attempts to add ad-hoc behaviors +and knobs to tailor the behavior to specific workloads. Continuing +this direction will lead to problems which will be extremely difficult +to resolve in the long term. + +Multiple controllers struggle with internal tasks and came up with +different ways to deal with it; unfortunately, all the approaches in +use now are severely flawed and, furthermore, the widely different +behaviors make cgroup as whole highly inconsistent. + +It is clear that this is something which needs to be addressed from +cgroup core proper in a uniform way so that controllers don't need to +worry about it and cgroup as a whole shows a consistent and logical +behavior. To achieve that, unified hierarchy enforces the following +structural constraint: + + Except for the root, only cgroups which don't contain any task may + have controllers enabled in their "cgroup.subtree_control" files. + +Combined with other properties, this guarantees that, when a +controller is looking at the part of the hierarchy which has it +enabled, tasks are always only on the leaves. This rules out +situations where child cgroups compete against internal tasks of the +parent. + +There are two things to note. Firstly, the root cgroup is exempt from +the restriction. Root contains tasks and anonymous resource +consumption which can't be associated with any other cgroup and +requires special treatment from most controllers. How resource +consumption in the root cgroup is governed is up to each controller. + +Secondly, the restriction doesn't take effect if there is no enabled +controller in the cgroup's "cgroup.subtree_control" file. This is +important as otherwise it wouldn't be possible to create children of a +populated cgroup. To control resource distribution of a cgroup, the +cgroup must create children and transfer all its tasks to the children +before enabling controllers in its "cgroup.subtree_control" file. + + +4. Other Changes + +4-1. [Un]populated Notification + +cgroup users often need a way to determine when a cgroup's +subhierarchy becomes empty so that it can be cleaned up. cgroup +currently provides release_agent for it; unfortunately, this mechanism +is riddled with issues. + +- It delivers events by forking and execing a userland binary + specified as the release_agent. This is a long deprecated method of + notification delivery. It's extremely heavy, slow and cumbersome to + integrate with larger infrastructure. + +- There is single monitoring point at the root. There's no way to + delegate management of a subtree. + +- The event isn't recursive. It triggers when a cgroup doesn't have + any tasks or child cgroups. Events for internal nodes trigger only + after all children are removed. This again makes it impossible to + delegate management of a subtree. + +- Events are filtered from the kernel side. A "notify_on_release" + file is used to subscribe to or suppress release events. This is + unnecessarily complicated and probably done this way because event + delivery itself was expensive. + +Unified hierarchy implements an interface file "cgroup.populated" +which can be used to monitor whether the cgroup's subhierarchy has +tasks in it or not. Its value is 0 if there is no task in the cgroup +and its descendants; otherwise, 1. poll and [id]notify events are +triggered when the value changes. + +This is significantly lighter and simpler and trivially allows +delegating management of subhierarchy - subhierarchy monitoring can +block further propagation simply by putting itself or another process +in the subhierarchy and monitor events that it's interested in from +there without interfering with monitoring higher in the tree. + +In unified hierarchy, the release_agent mechanism is no longer +supported and the interface files "release_agent" and +"notify_on_release" do not exist. + + +4-2. Other Core Changes + +- None of the mount options is allowed. + +- remount is disallowed. + +- rename(2) is disallowed. + +- The "tasks" file is removed. Everything should at process + granularity. Use the "cgroup.procs" file instead. + +- The "cgroup.procs" file is not sorted. pids will be unique unless + they got recycled in-between reads. + +- The "cgroup.clone_children" file is removed. + + +4-3. Per-Controller Changes + +4-3-1. blkio + +- blk-throttle becomes properly hierarchical. + + +4-3-2. cpuset + +- Tasks are kept in empty cpusets after hotplug and take on the masks + of the nearest non-empty ancestor, instead of being moved to it. + +- A task can be moved into an empty cpuset, and again it takes on the + masks of the nearest non-empty ancestor. + + +4-3-3. memory + +- use_hierarchy is on by default and the cgroup file for the flag is + not created. + + +5. Planned Changes + +5-1. CAP for resource control + +Unified hierarchy will require one of the capabilities(7), which is +yet to be decided, for all resource control related knobs. Process +organization operations - creation of sub-cgroups and migration of +processes in sub-hierarchies may be delegated by changing the +ownership and/or permissions on the cgroup directory and +"cgroup.procs" interface file; however, all operations which affect +resource control - writes to a "cgroup.subtree_control" file or any +controller-specific knobs - will require an explicit CAP privilege. + +This, in part, is to prevent the cgroup interface from being +inadvertently promoted to programmable API used by non-privileged +binaries. cgroup exposes various aspects of the system in ways which +aren't properly abstracted for direct consumption by regular programs. +This is an administration interface much closer to sysctl knobs than +system calls. Even the basic access model, being filesystem path +based, isn't suitable for direct consumption. There's no way to +access "my cgroup" in a race-free way or make multiple operations +atomic against migration to another cgroup. + +Another aspect is that, for better or for worse, the cgroup interface +goes through far less scrutiny than regular interfaces for +unprivileged userland. The upside is that cgroup is able to expose +useful features which may not be suitable for general consumption in a +reasonable time frame. It provides a relatively short path between +internal details and userland-visible interface. Of course, this +shortcut comes with high risk. We go through what we go through for +general kernel APIs for good reasons. It may end up leaking internal +details in a way which can exert significant pain by locking the +kernel into a contract that can't be maintained in a reasonable +manner. + +Also, due to the specific nature, cgroup and its controllers don't +tend to attract attention from a wide scope of developers. cgroup's +short history is already fraught with severely mis-designed +interfaces, unnecessary commitments to and exposing of internal +details, broken and dangerous implementations of various features. + +Keeping cgroup as an administration interface is both advantageous for +its role and imperative given its nature. Some of the cgroup features +may make sense for unprivileged access. If deemed justified, those +must be further abstracted and implemented as a different interface, +be it a system call or process-private filesystem, and survive through +the scrutiny that any interface for general consumption is required to +go through. + +Requiring CAP is not a complete solution but should serve as a +significant deterrent against spraying cgroup usages in non-privileged +programs.