1. 12 10月, 2022 1 次提交
  2. 27 9月, 2022 1 次提交
  3. 09 9月, 2022 1 次提交
  4. 02 9月, 2022 1 次提交
  5. 27 8月, 2022 1 次提交
    • M
      cgroup: Homogenize cgroup_get_from_id() return value · fa7e439c
      Michal Koutný 提交于
      Cgroup id is user provided datum hence extend its return domain to
      include possible error reason (similar to cgroup_get_from_fd()).
      
      This change also fixes commit d4ccaf58 ("bpf: Introduce cgroup
      iter") that would use NULL instead of proper error handling in
      d4ccaf58 ("bpf: Introduce cgroup iter").
      
      Additionally, neither of: fc_appid_store, bpf_iter_attach_cgroup,
      mem_cgroup_get_from_ino (callers of cgroup_get_from_fd) is built without
      CONFIG_CGROUPS (depends via CONFIG_BLK_CGROUP, direct, transitive
      CONFIG_MEMCG respectively) transitive, so drop the singular definition
      not needed with !CONFIG_CGROUPS.
      
      Fixes: d4ccaf58 ("bpf: Introduce cgroup iter")
      Signed-off-by: NMichal Koutný <mkoutny@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fa7e439c
  6. 16 8月, 2022 2 次提交
  7. 08 6月, 2022 1 次提交
  8. 12 3月, 2022 1 次提交
  9. 01 3月, 2022 2 次提交
  10. 14 9月, 2021 1 次提交
    • D
      bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode · 8520e224
      Daniel Borkmann 提交于
      Fix cgroup v1 interference when non-root cgroup v2 BPF programs are used.
      Back in the days, commit bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      embedded per-socket cgroup information into sock->sk_cgrp_data and in order
      to save 8 bytes in struct sock made both mutually exclusive, that is, when
      cgroup v1 socket tagging (e.g. net_cls/net_prio) is used, then cgroup v2
      falls back to the root cgroup in sock_cgroup_ptr() (&cgrp_dfl_root.cgrp).
      
      The assumption made was "there is no reason to mix the two and this is in line
      with how legacy and v2 compatibility is handled" as stated in bd1060a1.
      However, with Kubernetes more widely supporting cgroups v2 as well nowadays,
      this assumption no longer holds, and the possibility of the v1/v2 mixed mode
      with the v2 root fallback being hit becomes a real security issue.
      
      Many of the cgroup v2 BPF programs are also used for policy enforcement, just
      to pick _one_ example, that is, to programmatically deny socket related system
      calls like connect(2) or bind(2). A v2 root fallback would implicitly cause
      a policy bypass for the affected Pods.
      
      In production environments, we have recently seen this case due to various
      circumstances: i) a different 3rd party agent and/or ii) a container runtime
      such as [0] in the user's environment configuring legacy cgroup v1 net_cls
      tags, which triggered implicitly mentioned root fallback. Another case is
      Kubernetes projects like kind [1] which create Kubernetes nodes in a container
      and also add cgroup namespaces to the mix, meaning programs which are attached
      to the cgroup v2 root of the cgroup namespace get attached to a non-root
      cgroup v2 path from init namespace point of view. And the latter's root is
      out of reach for agents on a kind Kubernetes node to configure. Meaning, any
      entity on the node setting cgroup v1 net_cls tag will trigger the bypass
      despite cgroup v2 BPF programs attached to the namespace root.
      
      Generally, this mutual exclusiveness does not hold anymore in today's user
      environments and makes cgroup v2 usage from BPF side fragile and unreliable.
      This fix adds proper struct cgroup pointer for the cgroup v2 case to struct
      sock_cgroup_data in order to address these issues; this implicitly also fixes
      the tradeoffs being made back then with regards to races and refcount leaks
      as stated in bd1060a1, and removes the fallback, so that cgroup v2 BPF
      programs always operate as expected.
      
        [0] https://github.com/nestybox/sysbox/
        [1] https://kind.sigs.k8s.io/
      
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NStanislav Fomichev <sdf@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Link: https://lore.kernel.org/bpf/20210913230759.2313-1-daniel@iogearbox.net
      8520e224
  11. 10 6月, 2021 1 次提交
  12. 09 6月, 2021 1 次提交
  13. 25 5月, 2021 1 次提交
  14. 11 5月, 2021 1 次提交
  15. 17 2月, 2021 1 次提交
  16. 19 8月, 2020 1 次提交
    • K
      cgroup: Use generic ns_common::count · f387882d
      Kirill Tkhai 提交于
      Switch over cgroup namespaces to use the newly introduced common lifetime
      counter.
      
      Currently every namespace type has its own lifetime counter which is stored
      in the specific namespace struct. The lifetime counters are used
      identically for all namespaces types. Namespaces may of course have
      additional unrelated counters and these are not altered.
      
      This introduces a common lifetime counter into struct ns_common. The
      ns_common struct encompasses information that all namespaces share. That
      should include the lifetime counter since its common for all of them.
      
      It also allows us to unify the type of the counters across all namespaces.
      Most of them use refcount_t but one uses atomic_t and at least one uses
      kref. Especially the last one doesn't make much sense since it's just a
      wrapper around refcount_t since 2016 and actually complicates cleanup
      operations by having to use container_of() to cast the correct namespace
      struct out of struct ns_common.
      
      Having the lifetime counter for the namespaces in one place reduces
      maintenance cost. Not just because after switching all namespaces over we
      will have removed more code than we added but also because the logic is
      more easily understandable and we indicate to the user that the basic
      lifetime requirements for all namespaces are currently identical.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Link: https://lore.kernel.org/r/159644980994.604812.383801057081594972.stgit@localhost.localdomainSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      f387882d
  17. 08 7月, 2020 1 次提交
  18. 13 2月, 2020 3 次提交
    • C
      clone3: allow spawning processes into cgroups · ef2c41cf
      Christian Brauner 提交于
      This adds support for creating a process in a different cgroup than its
      parent. Callers can limit and account processes and threads right from
      the moment they are spawned:
      - A service manager can directly spawn new services into dedicated
        cgroups.
      - A process can be directly created in a frozen cgroup and will be
        frozen as well.
      - The initial accounting jitter experienced by process supervisors and
        daemons is eliminated with this.
      - Threaded applications or even thread implementations can choose to
        create a specific cgroup layout where each thread is spawned
        directly into a dedicated cgroup.
      
      This feature is limited to the unified hierarchy. Callers need to pass
      a directory file descriptor for the target cgroup. The caller can
      choose to pass an O_PATH file descriptor. All usual migration
      restrictions apply, i.e. there can be no processes in inner nodes. In
      general, creating a process directly in a target cgroup adheres to all
      migration restrictions.
      
      One of the biggest advantages of this feature is that CLONE_INTO_GROUP does
      not need to grab the write side of the cgroup cgroup_threadgroup_rwsem.
      This global lock makes moving tasks/threads around super expensive. With
      clone3() this lock is avoided.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ef2c41cf
    • M
      cgroup: Clean up css_set task traversal · f43caa2a
      Michal Koutný 提交于
      css_task_iter stores pointer to head of each iterable list, this dates
      back to commit 0f0a2b4f ("cgroup: reorganize css_task_iter") when we
      did not store cur_cset. Let us utilize list heads directly in cur_cset
      and streamline css_task_iter_advance_css_set a bit. This is no
      intentional function change.
      Signed-off-by: NMichal Koutný <mkoutny@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f43caa2a
    • M
      cgroup: Iterate tasks that did not finish do_exit() · 9c974c77
      Michal Koutný 提交于
      PF_EXITING is set earlier than actual removal from css_set when a task
      is exitting. This can confuse cgroup.procs readers who see no PF_EXITING
      tasks, however, rmdir is checking against css_set membership so it can
      transitionally fail with EBUSY.
      
      Fix this by listing tasks that weren't unlinked from css_set active
      lists.
      It may happen that other users of the task iterator (without
      CSS_TASK_ITER_PROCS) spot a PF_EXITING task before cgroup_exit(). This
      is equal to the state before commit c03cd773 ("cgroup: Include dying
      leaders with live threads in PROCS iterations") but it may be reviewed
      later.
      Reported-by: NSuren Baghdasaryan <surenb@google.com>
      Fixes: c03cd773 ("cgroup: Include dying leaders with live threads in PROCS iterations")
      Signed-off-by: NMichal Koutný <mkoutny@suse.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9c974c77
  19. 13 11月, 2019 2 次提交
    • T
      cgroup: use cgrp->kn->id as the cgroup ID · 74321038
      Tejun Heo 提交于
      cgroup ID is currently allocated using a dedicated per-hierarchy idr
      and used internally and exposed through tracepoints and bpf.  This is
      confusing because there are tracepoints and other interfaces which use
      the cgroupfs ino as IDs.
      
      The preceding changes made kn->id exposed as ino as 64bit ino on
      supported archs or ino+gen (low 32bits as ino, high gen).  There's no
      reason for cgroup to use different IDs.  The kernfs IDs are unique and
      userland can easily discover them and map them back to paths using
      standard file operations.
      
      This patch replaces cgroup IDs with kernfs IDs.
      
      * cgroup_id() is added and all cgroup ID users are converted to use it.
      
      * kernfs_node creation is moved to earlier during cgroup init so that
        cgroup_id() is available during init.
      
      * While at it, s/cgroup/cgrp/ in psi helpers for consistency.
      
      * Fallback ID value is changed to 1 to be consistent with root cgroup
        ID.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      74321038
    • T
      kernfs: convert kernfs_node->id from union kernfs_node_id to u64 · 67c0496e
      Tejun Heo 提交于
      kernfs_node->id is currently a union kernfs_node_id which represents
      either a 32bit (ino, gen) pair or u64 value.  I can't see much value
      in the usage of the union - all that's needed is a 64bit ID which the
      current code is already limited to.  Using a union makes the code
      unnecessarily complicated and prevents using 64bit ino without adding
      practical benefits.
      
      This patch drops union kernfs_node_id and makes kernfs_node->id a u64.
      ino is stored in the lower 32bits and gen upper.  Accessors -
      kernfs[_id]_ino() and kernfs[_id]_gen() - are added to retrieve the
      ino and gen.  This simplifies ID handling less cumbersome and will
      allow using 64bit inos on supported archs.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      67c0496e
  20. 25 10月, 2019 1 次提交
    • T
      cgroup: remove cgroup_enable_task_cg_lists() optimization · 5153faac
      Tejun Heo 提交于
      cgroup_enable_task_cg_lists() is used to lazyily initialize task
      cgroup associations on the first use to reduce fork / exit overheads
      on systems which don't use cgroup.  Unfortunately, locking around it
      has never been actually correct and its value is dubious given how the
      vast majority of systems use cgroup right away from boot.
      
      This patch removes the optimization.  For now, replace the cg_list
      based branches with WARN_ON_ONCE()'s to be on the safe side.  We can
      simplify the logic further in the future.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5153faac
  21. 25 7月, 2019 1 次提交
  22. 10 7月, 2019 1 次提交
  23. 01 6月, 2019 3 次提交
    • T
      cgroup: add cgroup_parse_float() · a5e112e6
      Tejun Heo 提交于
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a5e112e6
    • T
      cgroup: Include dying leaders with live threads in PROCS iterations · c03cd773
      Tejun Heo 提交于
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      c03cd773
    • T
      cgroup: Implement css_task_iter_skip() · b636fd38
      Tejun Heo 提交于
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      b636fd38
  24. 30 5月, 2019 1 次提交
    • T
      cgroup: Use css_tryget() instead of css_tryget_online() in task_get_css() · 18fa84a2
      Tejun Heo 提交于
      A PF_EXITING task can stay associated with an offline css.  If such
      task calls task_get_css(), it can get stuck indefinitely.  This can be
      triggered by BSD process accounting which writes to a file with
      PF_EXITING set when racing against memcg disable as in the backtrace
      at the end.
      
      After this change, task_get_css() may return a css which was already
      offline when the function was called.  None of the existing users are
      affected by this change.
      
        INFO: rcu_sched self-detected stall on CPU
        INFO: rcu_sched detected stalls on CPUs/tasks:
        ...
        NMI backtrace for cpu 0
        ...
        Call Trace:
         <IRQ>
         dump_stack+0x46/0x68
         nmi_cpu_backtrace.cold.2+0x13/0x57
         nmi_trigger_cpumask_backtrace+0xba/0xca
         rcu_dump_cpu_stacks+0x9e/0xce
         rcu_check_callbacks.cold.74+0x2af/0x433
         update_process_times+0x28/0x60
         tick_sched_timer+0x34/0x70
         __hrtimer_run_queues+0xee/0x250
         hrtimer_interrupt+0xf4/0x210
         smp_apic_timer_interrupt+0x56/0x110
         apic_timer_interrupt+0xf/0x20
         </IRQ>
        RIP: 0010:balance_dirty_pages_ratelimited+0x28f/0x3d0
        ...
         btrfs_file_write_iter+0x31b/0x563
         __vfs_write+0xfa/0x140
         __kernel_write+0x4f/0x100
         do_acct_process+0x495/0x580
         acct_process+0xb9/0xdb
         do_exit+0x748/0xa00
         do_group_exit+0x3a/0xa0
         get_signal+0x254/0x560
         do_signal+0x23/0x5c0
         exit_to_usermode_loop+0x5d/0xa0
         prepare_exit_to_usermode+0x53/0x80
         retint_user+0x8/0x8
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.2+
      Fixes: ec438699 ("cgroup, block: implement task_get_css() and use it in bio_associate_current()")
      18fa84a2
  25. 29 5月, 2019 1 次提交
    • R
      bpf: decouple the lifetime of cgroup_bpf from cgroup itself · 4bfc0bb2
      Roman Gushchin 提交于
      Currently the lifetime of bpf programs attached to a cgroup is bound
      to the lifetime of the cgroup itself. It means that if a user
      forgets (or intentionally avoids) to detach a bpf program before
      removing the cgroup, it will stay attached up to the release of the
      cgroup. Since the cgroup can stay in the dying state (the state
      between being rmdir()'ed and being released) for a very long time, it
      leads to a waste of memory. Also, it blocks a possibility to implement
      the memcg-based memory accounting for bpf objects, because a circular
      reference dependency will occur. Charged memory pages are pinning the
      corresponding memory cgroup, and if the memory cgroup is pinning
      the attached bpf program, nothing will be ever released.
      
      A dying cgroup can not contain any processes, so the only chance for
      an attached bpf program to be executed is a live socket associated
      with the cgroup. So in order to release all bpf data early, let's
      count associated sockets using a new percpu refcounter. On cgroup
      removal the counter is transitioned to the atomic mode, and as soon
      as it reaches 0, all bpf programs are detached.
      
      Because cgroup_bpf_release() can block, it can't be called from
      the percpu ref counter callback directly, so instead an asynchronous
      work is scheduled.
      
      The reference counter is not socket specific, and can be used for any
      other types of programs, which can be executed from a cgroup-bpf hook
      outside of the process context, had such a need arise in the future.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: jolsa@redhat.com
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      4bfc0bb2
  26. 06 5月, 2019 1 次提交
    • R
      cgroup: get rid of cgroup_freezer_frozen_exit() · 96b9c592
      Roman Gushchin 提交于
      A task should never enter the exit path with the task->frozen bit set.
      Any frozen task must enter the signal handling loop and the only
      way to escape is through cgroup_leave_frozen(true), which
      unconditionally drops the task->frozen bit. So it means that
      cgroyp_freezer_frozen_exit() has zero chances to be called and
      has to be removed.
      
      Let's put a WARN_ON_ONCE() instead of the cgroup_freezer_frozen_exit()
      call to catch any potential leak of the task's frozen bit.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      96b9c592
  27. 20 4月, 2019 1 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
  28. 31 1月, 2019 1 次提交
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab
      Oleg Nesterov 提交于
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      51bee5ab
  29. 08 12月, 2018 1 次提交
  30. 02 11月, 2018 1 次提交
  31. 27 10月, 2018 1 次提交
  32. 25 9月, 2018 1 次提交
  33. 22 9月, 2018 1 次提交