1. 21 7月, 2017 2 次提交
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
    • T
      cgroup: reorganize cgroup.procs / task write path · 715c809d
      Tejun Heo 提交于
      Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
      handled by __cgroup_procs_write() on both v1 and v2.  This patch
      reoragnizes the write path so that there are common helper functions
      that different write paths use.
      
      While this somewhat increases LOC, the different paths are no longer
      intertwined and each path has more flexibility to implement different
      behaviors which will be necessary for the planned v2 thread support.
      
      v3: - Restructured so that cgroup_procs_write_permission() takes
            @src_cgrp and @dst_cgrp.
      
      v2: - Rolled in Waiman's task reference count fix.
          - Updated on top of nsdelegate changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      715c809d
  2. 17 7月, 2017 3 次提交
  3. 29 6月, 2017 2 次提交
    • T
      cgroup: implement "nsdelegate" mount option · 5136f636
      Tejun Heo 提交于
      Currently, cgroup only supports delegation to !root users and cgroup
      namespaces don't get any special treatments.  This limits the
      usefulness of cgroup namespaces as they by themselves can't be safe
      delegation boundaries.  A process inside a cgroup can change the
      resource control knobs of the parent in the namespace root and may
      move processes in and out of the namespace if cgroups outside its
      namespace are visible somehow.
      
      This patch adds a new mount option "nsdelegate" which makes cgroup
      namespaces delegation boundaries.  If set, cgroup behaves as if write
      permission based delegation took place at namespace boundaries -
      writes to the resource control knobs from the namespace root are
      denied and migration crossing the namespace boundary aren't allowed
      from inside the namespace.
      
      This allows cgroup namespace to function as a delegation boundary by
      itself.
      
      v2: Silently ignore nsdelegate specified on !init mounts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aravind Anbudurai <aru7@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      5136f636
    • T
      cgroup: restructure cgroup_procs_write_permission() · 824ecbe0
      Tejun Heo 提交于
      Restructure cgroup_procs_write_permission() to make extending
      permission logic easier.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      824ecbe0
  4. 15 6月, 2017 1 次提交
    • W
      cgroup: Keep accurate count of tasks in each css_set · 73a7242a
      Waiman Long 提交于
      The reference count in the css_set data structure was used as a
      proxy of the number of tasks attached to that css_set. However, that
      count is actually not an accurate measure especially with thread mode
      support. So a new variable nr_tasks is added to the css_set to keep
      track of the actual task count. This new variable is protected by
      the css_set_lock. Functions that require the actual task count are
      updated to use the new variable.
      
      tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
          Refreshed on top of cgroup/for-v4.13 which dropped on
          css_set_populated() -> nr_tasks conversion.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73a7242a
  5. 18 5月, 2017 1 次提交
  6. 02 5月, 2017 1 次提交
  7. 29 4月, 2017 2 次提交
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks, take 2 · 9732adc5
      Zefan Li 提交于
      Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two
      different superblocks") is broken.  Now we try to fix the race by
      delaying the initialization of cgroup root refcnt until a superblock
      has been allocated.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
      Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9732adc5
    • T
      cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() · a590b90d
      Tejun Heo 提交于
      cgroup_get() expected to be called only on live cgroups and triggers
      warning on a dead cgroup; however, cgroup_sk_alloc() may be called
      while cloning a socket which is left in an empty and removed cgroup
      and thus may legitimately duplicate its reference on a dead cgroup.
      This currently triggers the following warning spuriously.
      
       WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
       ...
        [<ffffffff8107e123>] __warn+0xd3/0xf0
        [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
        [<ffffffff810ff465>] cgroup_get+0x55/0x60
        [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
        [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
        [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
        [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
        [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
        [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
        [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
        [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
        [<ffffffff81837cb2>] ip6_input+0x32/0xb0
        [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
        [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
        [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
        [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
        [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
        [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
        [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
        [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
        [<ffffffff81778b30>] net_rx_action+0x210/0x350
        [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
        [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
        [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
        [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
        [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
        [<ffffffff8103edd1>] start_secondary+0xf1/0x100
      
      This patch renames the existing cgroup_get() with the dead cgroup
      warning to cgroup_get_live() after cgroup_kn_lock_live() and
      introduces the new cgroup_get() which doesn't check whether the cgroup
      is live or dead.
      
      All existing cgroup_get() users except for cgroup_sk_alloc() are
      converted to use cgroup_get_live().
      
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Cc: stable@vger.kernel.org # v4.5+
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a590b90d
  8. 17 3月, 2017 1 次提交
    • T
      cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
      Tejun Heo 提交于
      Creation of a kthread goes through a couple interlocked stages between
      the kthread itself and its creator.  Once the new kthread starts
      running, it initializes itself and wakes up the creator.  The creator
      then can further configure the kthread and then let it start doing its
      job by waking it up.
      
      In this configuration-by-creator stage, the creator is the only one
      that can wake it up but the kthread is visible to userland.  When
      altering the kthread's attributes from userland is allowed, this is
      fine; however, for cases where CPU affinity is critical,
      kthread_bind() is used to first disable affinity changes from userland
      and then set the affinity.  This also prevents the kthread from being
      migrated into non-root cgroups as that can affect the CPU affinity and
      many other things.
      
      Unfortunately, the cgroup side of protection is racy.  While the
      PF_NO_SETAFFINITY flag prevents further migrations, userland can win
      the race before the creator sets the flag with kthread_bind() and put
      the kthread in a non-root cgroup, which can lead to all sorts of
      problems including incorrect CPU affinity and starvation.
      
      This bug got triggered by userland which periodically tries to migrate
      all processes in the root cpuset cgroup to a non-root one.  Per-cpu
      workqueue workers got caught while being created and ended up with
      incorrected CPU affinity breaking concurrency management and sometimes
      stalling workqueue execution.
      
      This patch adds task->no_cgroup_migration which disallows the task to
      be migrated by userland.  kthreadd starts with the flag set making
      every child kthread start in the root cgroup with migration
      disallowed.  The flag is cleared after the kthread finishes
      initialization by which time PF_NO_SETAFFINITY is set if the kthread
      should stay in the root cgroup.
      
      It'd be better to wait for the initialization instead of failing but I
      couldn't think of a way of implementing that without adding either a
      new PF flag, or sleeping and retrying from waiting side.  Even if
      userland depends on changing cgroup membership of a kthread, it either
      has to be synchronized with kthread_create() or periodically repeat,
      so it's unlikely that this would break anything.
      
      v2: Switch to a simpler implementation using a new task_struct bit
          field suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-debugged-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
      Signed-off-by: NTejun Heo <tj@kernel.org>
      77f88796
  9. 10 3月, 2017 1 次提交
  10. 09 3月, 2017 1 次提交
  11. 07 3月, 2017 1 次提交
  12. 02 3月, 2017 1 次提交
  13. 03 2月, 2017 1 次提交
    • T
      cgroup: drop the matching uid requirement on migration for cgroup v2 · 576dd464
      Tejun Heo 提交于
      Along with the write access to the cgroup.procs or tasks file, cgroup
      has required the writer's euid, unless root, to match [s]uid of the
      target process or task.  On cgroup v1, this is necessary because
      there's nothing preventing a delegatee from pulling in tasks or
      processes from all over the system.
      
      If a user has a cgroup subdirectory delegated to it, the user would
      have write access to the cgroup.procs or tasks file.  If there are no
      further checks than file write access check, the user would be able to
      pull processes from all over the system into its subhierarchy which is
      clearly not the intended behavior.  The matching [s]uid requirement
      partially prevents this problem by allowing a delegatee to pull in the
      processes that belongs to it.  This isn't a sufficient protection
      however, because a user would still be able to jump processes across
      two disjoint sub-hierarchies that has been delegated to them.
      
      cgroup v2 resolves the issue by requiring the writer to have access to
      the common ancestor of the cgroup.procs file of the source and target
      cgroups.  This confines each delegatee to their own sub-hierarchy
      proper and bases all permission decisions on the cgroup filesystem
      rather than having to pull in explicit uid matching.
      
      cgroup v2 has still been applying the matching [s]uid requirement just
      for historical reasons.  On cgroup2, the requirement doesn't serve any
      purpose while unnecessarily complicating the permission model.  Let's
      drop it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      576dd464
  14. 31 1月, 2017 1 次提交
    • T
      cgroup: misc cleanups · b807421a
      Tejun Heo 提交于
      * cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
        ss_masks.  Make it a u16.
      
      * Move have_canfork_callback together with other callback ss_masks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b807421a
  15. 16 1月, 2017 3 次提交
    • T
      cgroup: call subsys->*attach() only for subsystems which are actually affected by migration · bfc2cf6f
      Tejun Heo 提交于
      Currently, subsys->*attach() callbacks are called for all subsystems
      which are attached to the hierarchy on which the migration is taking
      place.
      
      With cgroup_migrate_prepare_dst() filtering out identity migrations,
      v1 hierarchies can avoid spurious ->*attach() callback invocations
      where the source and destination csses are identical; however, this
      isn't enough on v2 as only a subset of the attached controllers can be
      affected on controller enable/disable.
      
      While spurious ->*attach() invocations aren't critically broken,
      they're unnecessary overhead and can lead to temporary overcharges on
      certain controllers.  Fix it by tracking which subsystems are affected
      by a migration and invoking ->*attach() callbacks only on those
      subsystems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      bfc2cf6f
    • T
      cgroup: track migration context in cgroup_mgctx · e595cd70
      Tejun Heo 提交于
      cgroup migration is performed in four steps - css_set preloading,
      addition of target tasks, actual migration, and clean up.  A list
      named preloaded_csets is used to track the preloading.  This is a bit
      too restricted and the code is already depending on the subtlety that
      all source css_sets appear before destination ones.
      
      Let's create struct cgroup_mgctx which keeps track of everything
      during migration.  Currently, it has separate preload lists for source
      and destination csets and also embeds cgroup_taskset which is used
      during the actual migration.  This moves struct cgroup_taskset
      definition to cgroup-internal.h.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      e595cd70
    • T
      cgroup: cosmetic update to cgroup_taskset_add() · d8ebf519
      Tejun Heo 提交于
      cgroup_taskset_add() was using list_add_tail() when for source csets
      but list_move_tail() for destination.  As the operations are gated by
      list_empty() test, list_move_tail() is equivalent to list_add_tail()
      here.  Use list_add_tail() too for destination csets too.
      
      This doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      d8ebf519
  16. 28 12月, 2016 12 次提交
  17. 26 11月, 2016 1 次提交
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
  18. 29 9月, 2016 1 次提交
  19. 24 9月, 2016 1 次提交
    • T
      cgroup: fix invalid controller enable rejections with cgroup namespace · 9157056d
      Tejun Heo 提交于
      On the v2 hierarchy, "cgroup.subtree_control" rejects controller
      enables if the cgroup has processes in it.  The enforcement of this
      logic assumes that the cgroup wouldn't have any css_sets associated
      with it if there are no tasks in the cgroup, which is no longer true
      since a79a908f ("cgroup: introduce cgroup namespaces").
      
      When a cgroup namespace is created, it pins the css_set of the
      creating task to use it as the root css_set of the namespace.  This
      extra reference stays as long as the namespace is around and makes
      "cgroup.subtree_control" think that the namespace root cgroup is not
      empty even when it is and thus reject controller enables.
      
      Fix it by making cgroup_subtree_control() walk and test emptiness of
      each css_set instead of testing whether the list_head is empty.
      
      While at it, update the comment of cgroup_task_count() to indicate
      that the returned value may be higher than the number of tasks, which
      has always been true due to temporary references and doesn't break
      anything.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NEvgeny Vereshchagin <evvers@ya.ru>
      Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
      Cc: Aditya Kali <adityakali@google.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org # v4.6+
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541
      9157056d
  20. 23 9月, 2016 2 次提交
  21. 20 9月, 2016 1 次提交