1. 17 12月, 2021 1 次提交
  2. 14 9月, 2021 1 次提交
    • D
      bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224
      Daniel Borkmann 提交于
      Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
      Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      embedded per-socket cgroup information into sock->sk_cgrp_data and in order
      to save 8 bytes in struct sock made both mutually exclusive, that is, when
      cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
      falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
      
      The assumption made was "there is no reason to mix the two and this is in line
      with how legacy and v2 compatibility is handled" as stated in bd1060a1.
      However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
      this assumption no longer holds, and the possibility of the v1/v2 mixed mode
      with the v2 root fallback being hit becomes a real security issue.
      
      Many of the cgroup v2 BPF programs are also used for policy enforcement, just
      to pick _one_ example, that is, to programmatically deny socket related system
      calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
      a policy bypass for the affected Pods.
      
      In production environments, we have recently seen this case due to various
      circumstances: i) a different 3rd party agent and/or ii) a container runtime
      such as [0] in the user's environment configuring legacy cgroup v1 net_cls
      tags, which triggered implicitly mentioned root fallback. Another case is
      Kubernetes projects like kind [1] which create Kubernetes nodes in a container
      and also add cgroup namespaces to the mix, meaning programs which are attached
      to the cgroup v2 root of the cgroup namespace get attached to a non-root
      cgroup v2 path from init namespace point of view. And the latter's root is
      out of reach for agents on a kind Kubernetes node to configure. Meaning, any
      entity on the node setting cgroup v1 net_cls tag will trigger the bypass
      despite cgroup v2 BPF programs attached to the namespace root.
      
      Generally, this mutual exclusiveness does not hold anymore in today's user
      environments and makes cgroup v2 usage from BPF side fragile and unreliable.
      This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
      sock_cgroup_data in order to address these issues; this implicitly also fixes
      the tradeoffs being made back then with regards to races and refcount leaks
      as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
      programs always operate as expected.
      
        [0] https://github.com/nestybox/sysbox/
        [1] https://kind.sigs.k8s.io/
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
      8520e224
  3. 09 6月, 2021 1 次提交
  4. 25 5月, 2021 1 次提交
  5. 10 5月, 2021 1 次提交
    • C
      cgroup: introduce cgroup.kill · 661ee628
      Christian Brauner 提交于
      Introduce the cgroup.kill file. It does what it says on the tin and
      allows a caller to kill a cgroup by writing "1" into cgroup.kill.
      The file is available in non-root cgroups.
      
      Killing cgroups is a process directed operation, i.e. the whole
      thread-group is affected. Consequently trying to write to cgroup.kill in
      threaded cgroups will be rejected and EOPNOTSUPP returned. This behavior
      aligns with cgroup.procs where reads in threaded-cgroups are rejected
      with EOPNOTSUPP.
      
      The cgroup.kill file is write-only since killing a cgroup is an event
      not which makes it different from e.g. freezer where a cgroup
      transitions between the two states.
      
      As with all new cgroup features cgroup.kill is recursive by default.
      
      Killing a cgroup is protected against concurrent migrations through the
      cgroup mutex. To protect against forkbombs and to mitigate the effect of
      racing forks a new CGRP_KILL css set lock protected flag is introduced
      that is set prior to killing a cgroup and unset after the cgroup has
      been killed. We can then check in cgroup_post_fork() where we hold the
      css set lock already whether the cgroup is currently being killed. If so
      we send the child a SIGKILL signal immediately taking it down as soon as
      it returns to userspace. To make the killing of the child semantically
      clean it is killed after all cgroup attachment operations have been
      finalized.
      
      There are various use-cases of this interface:
      - Containers usually have a conservative layout where each container
        usually has a delegated cgroup. For such layouts there is a 1:1
        mapping between container and cgroup. If the container in addition
        uses a separate pid namespace then killing a container usually becomes
        a simple kill -9 <container-init-pid> from an ancestor pid namespace.
        However, there are quite a few scenarios where that isn't true. For
        example, there are containers that share the cgroup with other
        processes on purpose that are supposed to be bound to the lifetime of
        the container but are not in the same pidns of the container.
        Containers that are in a delegated cgroup but share the pid namespace
        with the host or other containers.
      - Service managers such as systemd use cgroups to group and organize
        processes belonging to a service. They usually rely on a recursive
        algorithm now to kill a service. With cgroup.kill this becomes a
        simple write to cgroup.kill.
      - Userspace OOM implementations can make good use of this feature to
        efficiently take down whole cgroups quickly.
      - The kill program can gain a new
        kill --cgroup /sys/fs/cgroup/delegated
        flag to take down cgroups.
      
      A few observations about the semantics:
      - If parent and child are in the same cgroup and CLONE_INTO_CGROUP is
        not specified we are not taking cgroup mutex meaning the cgroup can be
        killed while a process in that cgroup is forking.
        If the kill request happens right before cgroup_can_fork() and before
        the parent grabs its siglock the parent is guaranteed to see the
        pending SIGKILL. In addition we perform another check in
        cgroup_post_fork() whether the cgroup is being killed and is so take
        down the child (see above). This is robust enough and protects gainst
        forkbombs. If userspace really really wants to have stricter
        protection the simple solution would be to grab the write side of the
        cgroup threadgroup rwsem which will force all ongoing forks to
        complete before killing starts. We concluded that this is not
        necessary as the semantics for concurrent forking should simply align
        with freezer where a similar check as cgroup_post_fork() is performed.
      
        For all other cases CLONE_INTO_CGROUP is required. In this case we
        will grab the cgroup mutex so the cgroup can't be killed while we
        fork. Once we're done with the fork and have dropped cgroup mutex we
        are visible and will be found by any subsequent kill request.
      - We obviously don't kill kthreads. This means a cgroup that has a
        kthread will not become empty after killing and consequently no
        unpopulated event will be generated. The assumption is that kthreads
        should be in the root cgroup only anyway so this is not an issue.
      - We skip killing tasks that already have pending fatal signals.
      - Freezer doesn't care about tasks in different pid namespaces, i.e. if
        you have two tasks in different pid namespaces the cgroup would still
        be frozen. The cgroup.kill mechanism consequently behaves the same
        way, i.e. we kill all processes and ignore in which pid namespace they
        exist.
      - If the caller is located in a cgroup that is killed the caller will
        obviously be killed as well.
      
      Link: https://lore.kernel.org/r/20210503143922.3093755-1-brauner@kernel.org
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: cgroups@vger.kernel.org
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      661ee628
  6. 16 12月, 2020 1 次提交
  7. 10 7月, 2020 1 次提交
  8. 08 7月, 2020 1 次提交
  9. 03 4月, 2020 1 次提交
    • J
      mm: memcontrol: recursive memory.low protection · 8a931f80
      Johannes Weiner 提交于
      Right now, the effective protection of any given cgroup is capped by its
      own explicit memory.low setting, regardless of what the parent says.  The
      reasons for this are mostly historical and ease of implementation: to make
      delegation of memory.low safe, effective protection is the min() of all
      memory.low up the tree.
      
      Unfortunately, this limitation makes it impossible to protect an entire
      subtree from another without forcing the user to make explicit protection
      allocations all the way to the leaf cgroups - something that is highly
      undesirable in real life scenarios.
      
      Consider memory in a data center host.  At the cgroup top level, we have a
      distinction between system management software and the actual workload the
      system is executing.  Both branches are further subdivided into individual
      services, job components etc.
      
      We want to protect the workload as a whole from the system management
      software, but that doesn't mean we want to protect and prioritize
      individual workload wrt each other.  Their memory demand can vary over
      time, and we'd want the VM to simply cache the hottest data within the
      workload subtree.  Yet, the current memory.low limitations force us to
      allocate a fixed amount of protection to each workload component in order
      to get protection from system management software in general.  This
      results in very inefficient resource distribution.
      
      Another concern with mandating downward allocation is that, as the
      complexity of the cgroup tree grows, it gets harder for the lower levels
      to be informed about decisions made at the host-level.  Consider a
      container inside a namespace that in turn creates its own nested tree of
      cgroups to run multiple workloads.  It'd be extremely difficult to
      configure memory.low parameters in those leaf cgroups that on one hand
      balance pressure among siblings as the container desires, while also
      reflecting the host-level protection from e.g.  rpm upgrades, that lie
      beyond one or more delegation and namespacing points in the tree.
      
      It's highly unusual from a cgroup interface POV that nested levels have to
      be aware of and reflect decisions made at higher levels for them to be
      effective.
      
      To enable such use cases and scale configurability for complex trees, this
      patch implements a resource inheritance model for memory that is similar
      to how the CPU and the IO controller implement work-conserving resource
      allocations: a share of a resource allocated to a subree always applies to
      the entire subtree recursively, while allowing, but not mandating,
      children to further specify distribution rules.
      
      That means that if protection is explicitly allocated among siblings,
      those configured shares are being followed during page reclaim just like
      they are now.  However, if the memory.low set at a higher level is not
      fully claimed by the children in that subtree, the "floating" remainder is
      applied to each cgroup in the tree in proportion to its size.  Since
      reclaim pressure is applied in proportion to size as well, each child in
      that tree gets the same boost, and the effect is neutral among siblings -
      with respect to each other, they behave as if no memory control was
      enabled at all, and the VM simply balances the memory demands optimally
      within the subtree.  But collectively those cgroups enjoy a boost over the
      cgroups in neighboring trees.
      
      E.g.  a leaf cgroup with a memory.low setting of 0 no longer means that
      it's not getting a share of the hierarchically assigned resource, just
      that it doesn't claim a fixed amount of it to protect from its siblings.
      
      This allows us to recursively protect one subtree (workload) from another
      (system management), while letting subgroups compete freely among each
      other - without having to assign fixed shares to each leaf, and without
      nested groups having to echo higher-level settings.
      
      The floating protection composes naturally with fixed protection.
      Consider the following example tree:
      
      		A            A: low = 2G
                     / \          A1: low = 1G
                    A1 A2         A2: low = 0G
      
      As outside pressure is applied to this tree, A1 will enjoy a fixed
      protection from A2 of 1G, but the remaining, unclaimed 1G from A is split
      evenly among A1 and A2, coming out to 1.5G and 0.5G.
      
      There is a slight risk of regressing theoretical setups where the
      top-level cgroups don't know about the true budgeting and set bogusly high
      "bypass" values that are meaningfully allocated down the tree.  Such
      setups would rely on unclaimed protection to be discarded, and
      distributing it would change the intended behavior.  Be safe and hide the
      new behavior behind a mount option, 'memory_recursiveprot'.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NChris Down <chris@chrisdown.name>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a931f80
  10. 13 2月, 2020 1 次提交
    • C
      clone3: allow spawning processes into cgroups · ef2c41cf
      Christian Brauner 提交于
      This adds support for creating a process in a different cgroup than its
      parent. Callers can limit and account processes and threads right from
      the moment they are spawned:
      - A service manager can directly spawn new services into dedicated
        cgroups.
      - A process can be directly created in a frozen cgroup and will be
        frozen as well.
      - The initial accounting jitter experienced by process supervisors and
        daemons is eliminated with this.
      - Threaded applications or even thread implementations can choose to
        create a specific cgroup layout where each thread is spawned
        directly into a dedicated cgroup.
      
      This feature is limited to the unified hierarchy. Callers need to pass
      a directory file descriptor for the target cgroup. The caller can
      choose to pass an O_PATH file descriptor. All usual migration
      restrictions apply, i.e. there can be no processes in inner nodes. In
      general, creating a process directly in a target cgroup adheres to all
      migration restrictions.
      
      One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
      not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
      This global lock makes moving tasks/threads around super expensive. With
      clone3() this lock is avoided.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ef2c41cf
  11. 13 11月, 2019 1 次提交
    • T
      cgroup: use cgrp->kn->id as the cgroup ID · 74321038
      Tejun Heo 提交于
      cgroup ID is currently allocated using a dedicated per-hierarchy idr
      and used internally and exposed through tracepoints and bpf.  This is
      confusing because there are tracepoints and other interfaces which use
      the cgroupfs ino as IDs.
      
      The preceding changes made kn->id exposed as ino as 64bit ino on
      supported archs or ino+gen (low 32bits as ino, high gen).  There's no
      reason for cgroup to use different IDs.  The kernfs IDs are unique and
      userland can easily discover them and map them back to paths using
      standard file operations.
      
      This patch replaces cgroup IDs with kernfs IDs.
      
      * cgroup_id() is added and all cgroup ID users are converted to use it.
      
      * kernfs_node creation is moved to earlier during cgroup init so that
        cgroup_id() is available during init.
      
      * While at it, s/cgroup/cgrp/ in psi helpers for consistency.
      
      * Fallback ID value is changed to 1 to be consistent with root cgroup
        ID.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      74321038
  12. 07 11月, 2019 1 次提交
    • T
      cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency · 1bb5ec2e
      Tejun Heo 提交于
      cgroup->bstat_pending is used to determine the base stat delta to
      propagate to the parent.  While correct, this is different from how
      percpu delta is determined for no good reason and the inconsistency
      makes the code more difficult to understand.
      
      This patch makes parent propagation delta calculation use the same
      method as percpu to global propagation.
      
      * cgroup_base_stat_accumulate() is renamed to cgroup_base_stat_add()
        and cgroup_base_stat_sub() is added.
      
      * percpu propagation calculation is updated to use the above helpers.
      
      * cgroup->bstat_pending is replaced with cgroup->last_bstat and
        updated to use the same calculation as percpu propagation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1bb5ec2e
  13. 15 7月, 2019 1 次提交
  14. 15 6月, 2019 1 次提交
  15. 10 6月, 2019 1 次提交
  16. 07 6月, 2019 1 次提交
  17. 02 6月, 2019 1 次提交
    • C
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down 提交于
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  18. 01 6月, 2019 1 次提交
  19. 20 4月, 2019 2 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4dcabece
      Roman Gushchin 提交于
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      4dcabece
  20. 06 3月, 2019 1 次提交
  21. 31 1月, 2019 1 次提交
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab
      Oleg Nesterov 提交于
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      51bee5ab
  22. 09 11月, 2018 1 次提交
    • W
      cpuset: Expose cpuset.cpus.subpartitions with cgroup_debug · 5cf8114d
      Waiman Long 提交于
      For debugging purpose, it will be useful to expose the content of the
      subparts_cpus as a read-only file to see if the code work correctly.
      However, subparts_cpus will not be used at all in most use cases. So
      adding a new cpuset file that clutters the cgroup directory may not be
      desirable.  This is now being done by using the hidden "cgroup_debug"
      kernel command line option to expose a new "cpuset.cpus.subpartitions"
      file.
      
      That option was originally used by the debug controller to expose
      itself when configured into the kernel. This is now extended to set an
      internal flag used by cgroup_addrm_files(). A new CFTYPE_DEBUG flag
      can now be used to specify that a cgroup file should only be created
      when the "cgroup_debug" option is specified.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5cf8114d
  23. 27 10月, 2018 1 次提交
  24. 05 10月, 2018 1 次提交
    • T
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo 提交于
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: NAmin Jamali <ajamali@pivotal.io>
      Reported-by: NJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  25. 09 7月, 2018 1 次提交
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
  26. 27 4月, 2018 4 次提交
    • T
      cgroup: Add cgroup_subsys->css_rstat_flush() · 8f53470b
      Tejun Heo 提交于
      This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
      this callback, its csses are linked on cgrp->css_rstat_list and rstat
      will call the function whenever the associated cgroup is flushed.
      Flush is also performed when such csses are released so that residual
      counts aren't lost.
      
      Combined with the rstat API previous patches factored out, this allows
      controllers to plug into rstat to manage their statistics in a
      scalable way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8f53470b
    • T
      cgroup: Distinguish base resource stat implementation from rstat · d4ff749b
      Tejun Heo 提交于
      Base resource stat accounts universial (not specific to any
      controller) resource consumptions on top of rstat.  Currently, its
      implementation is intermixed with rstat implementation making the code
      confusing to follow.
      
      This patch clarifies the distintion by doing the followings.
      
      * Encapsulate base resource stat counters, currently only cputime, in
        struct cgroup_base_stat.
      
      * Move prev_cputime into struct cgroup and initialize it with cgroup.
      
      * Rename the related functions so that they start with cgroup_base_stat.
      
      * Prefix the related variables and field names with b.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d4ff749b
    • T
      cgroup: Rename stat to rstat · c58632b3
      Tejun Heo 提交于
      stat is too generic a name and ends up causing subtle confusions.
      It'll be made generic so that controllers can plug into it, which will
      make the problem worse.  Let's rename it to something more specific -
      cgroup_rstat for cgroup recursive stat.
      
      This patch does the following renames.  No other changes.
      
      * cpu_stat	-> rstat_cpu
      * stat		-> rstat
      * ?cstat	-> ?rstatc
      
      Note that the renames are selective.  The unrenamed are the ones which
      implement basic resource statistics on top of rstat.  This will be
      further cleaned up in the following patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c58632b3
    • T
      cgroup: Limit event generation frequency · b12e3583
      Tejun Heo 提交于
      ".events" files generate file modified event to notify userland of
      possible new events.  Some of the events can be quite bursty
      (e.g. memory high event) and generating notification each time is
      costly and pointless.
      
      This patch implements a event rate limit mechanism.  If a new
      notification is requested before 10ms has passed since the previous
      notification, the new notification is delayed till then.
      
      As this only delays from the second notification on in a given close
      cluster of notifications, userland reactions to notifications
      shouldn't be delayed at all in most cases while avoiding notification
      storms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b12e3583
  27. 20 3月, 2018 1 次提交
  28. 15 3月, 2018 1 次提交
  29. 02 1月, 2018 1 次提交
  30. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  31. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  32. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  33. 18 8月, 2017 1 次提交
  34. 03 8月, 2017 2 次提交
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
    • R
      cgroup: keep track of number of descent cgroups · 0679dee0
      Roman Gushchin 提交于
      Keep track of the number of online and dying descent cgroups.
      
      This data will be used later to add an ability to control cgroup
      hierarchy (limit the depth and the number of descent cgroups)
      and display hierarchy stats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      0679dee0
  35. 21 7月, 2017 1 次提交
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147