1. 27 7月, 2016 2 次提交
  2. 20 7月, 2016 1 次提交
  3. 15 7月, 2016 3 次提交
    • E
      cgroupns: Only allow creation of hierarchies in the initial cgroup namespace · 726a4994
      Eric W. Biederman 提交于
      Unprivileged users can't use hierarchies if they create them as they do not
      have privilieges to the root directory.
      
      Which means the only thing a hiearchy created by an unprivileged user
      is good for is expanding the number of cgroup links in every css_set,
      which is a DOS attack.
      
      We could allow hierarchies to be created in namespaces in the initial
      user namespace.  Unfortunately there is only a single namespace for
      the names of heirarchies, so that is likely to create more confusion
      than not.
      
      So do the simple thing and restrict hiearchy creation to the initial
      cgroup namespace.
      
      Cc: stable@vger.kernel.org
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      726a4994
    • E
      cgroupns: Close race between cgroup_post_fork and copy_cgroup_ns · eedd0f4c
      Eric W. Biederman 提交于
      In most code paths involving cgroup migration cgroup_threadgroup_rwsem
      is taken.  There are two exceptions:
      
      - remove_tasks_in_empty_cpuset calls cgroup_transfer_tasks
      - vhost_attach_cgroups_work calls cgroup_attach_task_all
      
      With cgroup_threadgroup_rwsem held it is guaranteed that cgroup_post_fork
      and copy_cgroup_ns will reference the same css_set from the process calling
      fork.
      
      Without such an interlock there process after fork could reference one
      css_set from it's new cgroup namespace and another css_set from
      task->cgroups, which semantically is nonsensical.
      
      Cc: stable@vger.kernel.org
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      eedd0f4c
    • E
      cgroupns: Fix the locking in copy_cgroup_ns · 7bd88308
      Eric W. Biederman 提交于
      If "clone(CLONE_NEWCGROUP...)" is called it results in a nice lockdep
      valid splat.
      
      In __cgroup_proc_write the lock ordering is:
           cgroup_mutex -- through cgroup_kn_lock_live
           cgroup_threadgroup_rwsem
      
      In copy_process the guts of clone the lock ordering is:
           cgroup_threadgroup_rwsem -- through threadgroup_change_begin
           cgroup_mutex -- through copy_namespaces -- copy_cgroup_ns
      
      lockdep reports some a different call chains for the first ordering of
      cgroup_mutex and cgroup_threadgroup_rwsem but it is harder to trace.
      This is most definitely deadlock potential under the right
      circumstances.
      
      Fix this by by skipping the cgroup_mutex and making the locking in
      copy_cgroup_ns mirror the locking in cgroup_post_fork which also runs
      during fork under the cgroup_threadgroup_rwsem.
      
      Cc: stable@vger.kernel.org
      Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      7bd88308
  4. 02 7月, 2016 1 次提交
  5. 24 6月, 2016 1 次提交
    • D
      cgroup: Disable IRQs while holding css_set_lock · 82d6489d
      Daniel Bristot de Oliveira 提交于
      While testing the deadline scheduler + cgroup setup I hit this
      warning.
      
      [  132.612935] ------------[ cut here ]------------
      [  132.612951] WARNING: CPU: 5 PID: 0 at kernel/softirq.c:150 __local_bh_enable_ip+0x6b/0x80
      [  132.612952] Modules linked in: (a ton of modules...)
      [  132.612981] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 4.7.0-rc2 #2
      [  132.612981] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [  132.612982]  0000000000000086 45c8bb5effdd088b ffff88013fd43da0 ffffffff813d229e
      [  132.612984]  0000000000000000 0000000000000000 ffff88013fd43de0 ffffffff810a652b
      [  132.612985]  00000096811387b5 0000000000000200 ffff8800bab29d80 ffff880034c54c00
      [  132.612986] Call Trace:
      [  132.612987]  <IRQ>  [<ffffffff813d229e>] dump_stack+0x63/0x85
      [  132.612994]  [<ffffffff810a652b>] __warn+0xcb/0xf0
      [  132.612997]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.612999]  [<ffffffff810a665d>] warn_slowpath_null+0x1d/0x20
      [  132.613000]  [<ffffffff810aba5b>] __local_bh_enable_ip+0x6b/0x80
      [  132.613008]  [<ffffffff817d6c8a>] _raw_write_unlock_bh+0x1a/0x20
      [  132.613010]  [<ffffffff817d6c9e>] _raw_spin_unlock_bh+0xe/0x10
      [  132.613015]  [<ffffffff811388ac>] put_css_set+0x5c/0x60
      [  132.613016]  [<ffffffff8113dc7f>] cgroup_free+0x7f/0xa0
      [  132.613017]  [<ffffffff810a3912>] __put_task_struct+0x42/0x140
      [  132.613018]  [<ffffffff810e776a>] dl_task_timer+0xca/0x250
      [  132.613027]  [<ffffffff810e76a0>] ? push_dl_task.part.32+0x170/0x170
      [  132.613030]  [<ffffffff8111371e>] __hrtimer_run_queues+0xee/0x270
      [  132.613031]  [<ffffffff81113ec8>] hrtimer_interrupt+0xa8/0x190
      [  132.613034]  [<ffffffff81051a58>] local_apic_timer_interrupt+0x38/0x60
      [  132.613035]  [<ffffffff817d9b0d>] smp_apic_timer_interrupt+0x3d/0x50
      [  132.613037]  [<ffffffff817d7c5c>] apic_timer_interrupt+0x8c/0xa0
      [  132.613038]  <EOI>  [<ffffffff81063466>] ? native_safe_halt+0x6/0x10
      [  132.613043]  [<ffffffff81037a4e>] default_idle+0x1e/0xd0
      [  132.613044]  [<ffffffff810381cf>] arch_cpu_idle+0xf/0x20
      [  132.613046]  [<ffffffff810e8fda>] default_idle_call+0x2a/0x40
      [  132.613047]  [<ffffffff810e92d7>] cpu_startup_entry+0x2e7/0x340
      [  132.613048]  [<ffffffff81050235>] start_secondary+0x155/0x190
      [  132.613049] ---[ end trace f91934d162ce9977 ]---
      
      The warn is the spin_(lock|unlock)_bh(&css_set_lock) in the interrupt
      context. Converting the spin_lock_bh to spin_lock_irq(save) to avoid
      this problem - and other problems of sharing a spinlock with an
      interrupt.
      
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: cgroups@vger.kernel.org
      Cc: stable@vger.kernel.org # 4.5+
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: N"Luis Claudio R. Goncalves" <lgoncalv@redhat.com>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      82d6489d
  6. 22 6月, 2016 1 次提交
    • T
      cgroup: allow NULL return from ss->css_alloc() · e7e15b87
      Tejun Heo 提交于
      cgroup core expected css_alloc to return an ERR_PTR value on failure
      and caused NULL deref if it returned NULL.  It's an easy mistake to
      make from an alloc function and there's no ambiguity in what's being
      indicated.  Update css_create() so that it interprets NULL return from
      css_alloc as -ENOMEM.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e7e15b87
  7. 18 6月, 2016 2 次提交
  8. 17 6月, 2016 1 次提交
    • T
      cgroup: set css->id to -1 during init · 8fa3b8d6
      Tejun Heo 提交于
      If percpu_ref initialization fails during css_create(), the free path
      can end up trying to free css->id of zero.  As ID 0 is unused, it
      doesn't cause a critical breakage but it does trigger a warning
      message.  Fix it by setting css->id to -1 from init_and_link_css().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Wenwei Tao <ww.tao0320@gmail.com>
      Fixes: 01e58659 ("cgroup: release css->id after css_free")
      Cc: stable@vger.kernel.org # v4.0+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8fa3b8d6
  9. 27 5月, 2016 1 次提交
    • W
      cgroup: remove redundant cleanup in css_create · b00c52da
      Wenwei Tao 提交于
      When create css failed, before call css_free_rcu_fn, we remove the css
      id and exit the percpu_ref, but we will do these again in
      css_free_work_fn, so they are redundant.  Especially the css id, that
      would cause problem if we remove it twice, since it may be assigned to
      another css after the first remove.
      
      tj: This was broken by two commits updating the free path without
          synchronizing the creation failure path.  This can be easily
          triggered by trying to create more than 64k memory cgroups.
      Signed-off-by: NWenwei Tao <ww.tao0320@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Fixes: 9a1049da ("percpu-refcount: require percpu_ref to be exited explicitly")
      Fixes: 01e58659 ("cgroup: release css->id after css_free")
      Cc: stable@vger.kernel.org # v3.17+
      b00c52da
  10. 12 5月, 2016 1 次提交
  11. 10 5月, 2016 1 次提交
    • S
      cgroup, kernfs: make mountinfo show properly scoped path for cgroup namespaces · 4f41fc59
      Serge E. Hallyn 提交于
      Patch summary:
      
      When showing a cgroupfs entry in mountinfo, show the path of the mount
      root dentry relative to the reader's cgroup namespace root.
      
      Short explanation (courtesy of mkerrisk):
      
      If we create a new cgroup namespace, then we want both /proc/self/cgroup
      and /proc/self/mountinfo to show cgroup paths that are correctly
      virtualized with respect to the cgroup mount point.  Previous to this
      patch, /proc/self/cgroup shows the right info, but /proc/self/mountinfo
      does not.
      
      Long version:
      
      When a uid 0 task which is in freezer cgroup /a/b, unshares a new cgroup
      namespace, and then mounts a new instance of the freezer cgroup, the new
      mount will be rooted at /a/b.  The root dentry field of the mountinfo
      entry will show '/a/b'.
      
       cat > /tmp/do1 << EOF
       mount -t cgroup -o freezer freezer /mnt
       grep freezer /proc/self/mountinfo
       EOF
      
       unshare -Gm  bash /tmp/do1
       > 330 160 0:34 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime - cgroup cgroup rw,freezer
       > 355 133 0:34 /a/b /mnt rw,relatime - cgroup freezer rw,freezer
      
      The task's freezer cgroup entry in /proc/self/cgroup will simply show
      '/':
      
       grep freezer /proc/self/cgroup
       9:freezer:/
      
      If instead the same task simply bind mounts the /a/b cgroup directory,
      the resulting mountinfo entry will again show /a/b for the dentry root.
      However in this case the task will find its own cgroup at /mnt/a/b,
      not at /mnt:
      
       mount --bind /sys/fs/cgroup/freezer/a/b /mnt
       130 25 0:34 /a/b /mnt rw,nosuid,nodev,noexec,relatime shared:21 - cgroup cgroup rw,freezer
      
      In other words, there is no way for the task to know, based on what is
      in mountinfo, which cgroup directory is its own.
      
      Example (by mkerrisk):
      
      First, a little script to save some typing and verbiage:
      
      echo -e "\t/proc/self/cgroup:\t$(cat /proc/self/cgroup | grep freezer)"
      cat /proc/self/mountinfo | grep freezer |
              awk '{print "\tmountinfo:\t\t" $4 "\t" $5}'
      
      Create cgroup, place this shell into the cgroup, and look at the state
      of the /proc files:
      
      2653
      2653                         # Our shell
      14254                        # cat(1)
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      Create a shell in new cgroup and mount namespaces. The act of creating
      a new cgroup namespace causes the process's current cgroups directories
      to become its cgroup root directories. (Here, I'm using my own version
      of the "unshare" utility, which takes the same options as the util-linux
      version):
      
      Look at the state of the /proc files:
      
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /sys/fs/cgroup/freezer
      
      The third entry in /proc/self/cgroup (the pathname of the cgroup inside
      the hierarchy) is correctly virtualized w.r.t. the cgroup namespace, which
      is rooted at /a/b in the outer namespace.
      
      However, the info in /proc/self/mountinfo is not for this cgroup
      namespace, since we are seeing a duplicate of the mount from the
      old mount namespace, and the info there does not correspond to the
      new cgroup namespace. However, trying to create a new mount still
      doesn't show us the right information in mountinfo:
      
                                            # propagating to other mountns
              /proc/self/cgroup:      7:freezer:/
              mountinfo:              /a/b    /mnt/freezer
      
      The act of creating a new cgroup namespace caused the process's
      current freezer directory, "/a/b", to become its cgroup freezer root
      directory. In other words, the pathname directory of the directory
      within the newly mounted cgroup filesystem should be "/",
      but mountinfo wrongly shows us "/a/b". The consequence of this is
      that the process in the cgroup namespace cannot correctly construct
      the pathname of its cgroup root directory from the information in
      /proc/PID/mountinfo.
      
      With this patch, the dentry root field in mountinfo is shown relative
      to the reader's cgroup namespace.  So the same steps as above:
      
              /proc/self/cgroup:      10:freezer:/a/b
              mountinfo:              /       /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /../..  /sys/fs/cgroup/freezer
              /proc/self/cgroup:      10:freezer:/
              mountinfo:              /       /mnt/freezer
      
      cgroup.clone_children  freezer.parent_freezing  freezer.state      tasks
      cgroup.procs           freezer.self_freezing    notify_on_release
      3164
      2653                   # First shell that placed in this cgroup
      3164                   # Shell started by 'unshare'
      14197                  # cat(1)
      Signed-off-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4f41fc59
  12. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  13. 17 3月, 2016 2 次提交
    • A
      cgroup: avoid false positive gcc-6 warning · cfe02a8a
      Arnd Bergmann 提交于
      When all subsystems are disabled, gcc notices that cgroup_subsys_enabled_key
      is a zero-length array and that any access to it must be out of bounds:
      
      In file included from ../include/linux/cgroup.h:19:0,
                       from ../kernel/cgroup.c:31:
      ../kernel/cgroup.c: In function 'cgroup_add_cftypes':
      ../kernel/cgroup.c:261:53: error: array subscript is above array bounds [-Werror=array-bounds]
        return static_key_enabled(cgroup_subsys_enabled_key[ssid]);
                                  ~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~
      ../include/linux/jump_label.h:271:40: note: in definition of macro 'static_key_enabled'
        static_key_count((struct static_key *)x) > 0;    \
                                              ^
      
      We should never call the function in this particular case, so this is
      not a bug. In order to silence the warning, this adds an explicit check
      for the CGROUP_SUBSYS_COUNT==0 case.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      cfe02a8a
    • T
      cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf
      Tejun Heo 提交于
      Before 2e91fa7f ("cgroup: keep zombies associated with their
      original cgroups"), all dead tasks were associated with init_css_set.
      If a zombie task is requested for migration, while migration prep
      operations would still be performed on init_css_set, the actual
      migration would ignore zombie tasks.  As init_css_set is always valid,
      this worked fine.
      
      However, after 2e91fa7f, zombie tasks stay with the css_set it was
      associated with at the time of death.  Let's say a task T associated
      with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
      becomes a zombie, it would still remain associated with A and B.  If A
      only contains zombie tasks, it can be removed.  On removal, A gets
      marked offline but stays pinned until all zombies are drained.  At
      this point, if migration is initiated on T to a cgroup C on hierarchy
      H-2, migration path would try to prepare T's css_set for migration and
      trigger the following.
      
       WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
       CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
       ...
       Call Trace:
        [<ffffffff8127e63c>] dump_stack+0x4e/0x82
        [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
        [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
        [<ffffffff810c33e1>] cgroup_get+0x121/0x160
        [<ffffffff810c349b>] link_css_set+0x7b/0x90
        [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
        [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
        [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
        [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
        [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
        [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
        [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
        [<ffffffff81151673>] __vfs_write+0x23/0xe0
        [<ffffffff81152494>] vfs_write+0xa4/0x1a0
        [<ffffffff811532d4>] SyS_write+0x44/0xa0
        [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It doesn't make sense to prepare migration for css_sets pointing to
      dead cgroups as they are guaranteed to contain only zombies which are
      ignored later during migration.  This patch makes cgroup destruction
      path mark all affected css_sets as dead and updates the migration path
      to ignore them during preparation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
      Cc: stable@vger.kernel.org # v4.4+
      2b021cbf
  14. 09 3月, 2016 5 次提交
    • T
      cgroup: implement cgroup_subsys->implicit_on_dfl · f6d635ad
      Tejun Heo 提交于
      Some controllers, perf_event for now and possibly freezer in the
      future, don't really make sense to control explicitly through
      "cgroup.subtree_control".  For example, the primary role of perf_event
      is identifying the cgroups of tasks; however, because the controller
      also keeps a small amount of state per cgroup, it can't be replaced
      with simple cgroup membership tests.
      
      This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
      the controller is implicitly enabled on all cgroups on the v2
      hierarchy so that utility type controllers such as perf_event can be
      enabled and function transparently.
      
      An implicit controller doesn't show up in "cgroup.controllers" or
      "cgroup.subtree_control", is exempt from no internal process rule and
      can be stolen from the default hierarchy even if there are non-root
      csses.
      
      v2: Reimplemented on top of the recent updates to css handling and
          subsystem rebinding.  Rebinding implicit subsystems is now a
          simple matter of exempting it from the busy subsystem check.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f6d635ad
    • T
      cgroup: use css_set->mg_dst_cgrp for the migration target cgroup · e4857982
      Tejun Heo 提交于
      Migration can be multi-target on the default hierarchy when a
      controller is enabled - processes belonging to each child cgroup have
      to be moved to the child cgroup itself to refresh css association.
      
      This isn't a problem for cgroup_migrate_add_src() as each source
      css_set still maps to single source and target cgroups; however,
      cgroup_migrate_prepare_dst() is called once after all source css_sets
      are added and thus might not have a single destination cgroup.  This
      is currently worked around by specifying NULL for @dst_cgrp and using
      the source's default cgroup as destination as the only multi-target
      migration in use is self-targetting.  While this works, it's subtle
      and clunky.
      
      As all taget cgroups are already specified while preparing the source
      css_sets, this clunkiness can easily be removed by recording the
      target cgroup in each source css_set.  This patch adds
      css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
      used by cgroup_migrate_prepare_dst().  This also makes migration code
      ready for arbitrary multi-target migration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4857982
    • T
      cgroup: make cgroup[_taskset]_migrate() take cgroup_root instead of cgroup · 37ff9f8f
      Tejun Heo 提交于
      On the default hierarchy, a migration can be multi-source and/or
      multi-destination.  cgroup_taskest_migrate() used to incorrectly
      assume single destination cgroup but the bug has been fixed by
      1f7dd3e5 ("cgroup: fix handling of multi-destination migration
      from subtree_control enabling").
      
      Since the commit, @dst_cgrp to cgroup[_taskset]_migrate() is only used
      to determine which subsystems are affected or which cgroup_root the
      migration is taking place in.  As such, @dst_cgrp is misleading.  This
      patch replaces @dst_cgrp with @root.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      37ff9f8f
    • T
      cgroup: move migration destination verification out of cgroup_migrate_prepare_dst() · 6c694c88
      Tejun Heo 提交于
      cgroup_migrate_prepare_dst() verifies whether the destination cgroup
      is allowable; however, the test doesn't really belong there.  It's too
      deep and common in the stack and as a result the test itself is gated
      by another test.
      
      Separate the test out into cgroup_may_migrate_to() and update
      cgroup_attach_task() and cgroup_transfer_tasks() to perform the test
      directly.  This doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6c694c88
    • T
      cgroup: fix incorrect destination cgroup in cgroup_update_dfl_csses() · 58cdb1ce
      Tejun Heo 提交于
      cgroup_update_dfl_csses() should move each task in the subtree to
      self; however, it was incorrectly calling cgroup_migrate_add_src()
      with the root of the subtree as @dst_cgrp.  Fortunately,
      cgroup_migrate_add_src() currently uses @dst_cgrp only to determine
      the hierarchy and the bug doesn't cause any actual breakages.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      58cdb1ce
  15. 03 3月, 2016 17 次提交
    • T
      cgroup: update css iteration in cgroup_update_dfl_csses() · 54962604
      Tejun Heo 提交于
      The existing sequences of operations ensure that the offlining csses
      are drained before cgroup_update_dfl_csses(), so even though
      cgroup_update_dfl_csses() uses css_for_each_descendant_pre() to walk
      the target cgroups, it doesn't end up operating on dead cgroups.
      Also, the function explicitly excludes the subtree root from
      operation.
      
      This is fragile and inconsistent with the rest of css update
      operations.  This patch updates cgroup_update_dfl_csses() to use
      cgroup_for_each_live_descendant_pre() instead and include the subtree
      root.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      54962604
    • T
      cgroup: allocate 2x cgrp_cset_links when setting up a new root · 04313591
      Tejun Heo 提交于
      During prep, cgroup_setup_root() allocates cgrp_cset_links matching
      the number of existing css_sets to later link the new root.  This is
      fine for now as the only operation which can happen inbetween is
      rebind_subsystems() and rebinding of empty subsystems doesn't create
      new css_sets.
      
      However, while not yet allowed, with the recent reimplementation,
      rebind_subsystems() can rebind subsystems with descendant csses and
      thus can create new css_sets.  This patch makes cgroup_setup_root()
      allocate 2x of the existing css_sets so that later use of live
      subsystem rebinding doesn't blow up.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      04313591
    • T
      cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask · 5ced2518
      Tejun Heo 提交于
      cgroup_calc_subtree_ss_mask() currently takes @cgrp and
      @subtree_control.  @cgrp is used for two purposes - to decide whether
      it's for default hierarchy and the mask of available subsystems.  The
      former doesn't matter as the results are the same regardless.  The
      latter can be specified directly through a subsystem mask.
      
      This patch makes cgroup_calc_subtree_ss_mask() perform the same
      calculations for both default and legacy hierarchies and take
      @this_ss_mask for available subsystems.  @cgrp is no longer used and
      dropped.  This is to allow using the function in contexts where
      available controllers can't be decided from the cgroup.
      
      v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
          Updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      5ced2518
    • T
      cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends · 334c3679
      Tejun Heo 提交于
      rebind_subsystem() open codes quite a bit of css and interface file
      manipulations.  It tries to be fail-safe but doesn't quite achieve it.
      It can be greatly simplified by using the new css management helpers.
      This patch reimplements rebind_subsytsems() using
      cgroup_apply_control() and friends.
      
      * The half-baked rollback on file creation failure is dropped.  It is
        an extremely cold path, failure isn't critical, and, aside from
        kernel bugs, the only reason it can fail is memory allocation
        failure which pretty much doesn't happen for small allocations.
      
      * As cgroup_apply_control_disable() is now used to clean up root
        cgroup on rebind, make sure that it doesn't end up killing root
        csses.
      
      * All callers of rebind_subsystems() are updated to use
        cgroup_lock_and_drain_offline() as the apply_control functions
        require drained subtree.
      
      * This leaves cgroup_refresh_subtree_ss_mask() without any user.
        Removed.
      
      * css_populate_dir() and css_clear_dir() no longer needs
        @cgrp_override parameter.  Dropped.
      
      * While at it, add WARN_ON() to rebind_subsystem() calls which are
        expected to always succeed just in case.
      
      While the rules visible to userland aren't changed, this
      reimplementation not only simplifies rebind_subsystems() but also
      allows it to disable and enable csses recursively.  This can be used
      to implement more flexible rebinding.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      334c3679
    • T
      cgroup: use cgroup_apply_enable_control() in cgroup creation path · 03970d3c
      Tejun Heo 提交于
      cgroup_create() manually updates control masks and creates child csses
      which cgroup_mkdir() then manually populates.  Both can be simplified
      by using cgroup_apply_enable_control() and friends.  The only catch is
      that it calls css_populate_dir() with NULL cgroup->kn during
      cgroup_create().  This is worked around by making the function noop on
      NULL kn.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      03970d3c
    • T
      cgroup: combine cgroup_mutex locking and offline css draining · 945ba199
      Tejun Heo 提交于
      cgroup_drain_offline() is used to wait for csses being offlined to
      uninstall itself from cgroup->subsys[] array so that new csses can be
      installed.  The function's only user, cgroup_subtree_control_write(),
      calls it after performing some checks and restarts the whole process
      via restart_syscall() if draining has to release cgroup_mutex to wait.
      
      This can be simplified by draining before other synchronized
      operations so that there's nothing to restart.  This patch converts
      cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
      performs both locking and draining and updates cgroup_kn_lock_live()
      use it instead of cgroup_mutex() if requested.  This combined locking
      and draining operations are easier to use and less error-prone.
      
      While at it, add WARNs in control_apply functions which triggers if
      the subtree isn't properly drained.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      945ba199
    • T
      cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write() · f7b2814b
      Tejun Heo 提交于
      Factor out cgroup_{apply|finalize}_control() so that control mask
      update can be done in several simple steps.  This patch doesn't
      introduce behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      f7b2814b
    • T
      cgroup: introduce cgroup_{save|propagate|restore}_control() · 15a27c36
      Tejun Heo 提交于
      While controllers are being enabled and disabled in
      cgroup_subtree_control_write(), the original subsystem masks are
      stashed in local variables so that they can be restored if the
      operation fails in the middle.
      
      This patch adds dedicated fields to struct cgroup to be used instead
      of the local variables and implements functions to stash the current
      values, propagate the changes and restore them recursively.  Combined
      with the previous changes, this makes subsystem management operations
      fully recursive and modularlized.  This will be used to expand cgroup
      core functionalities.
      
      While at it, remove now unused @css_enable and @css_disable from
      cgroup_subtree_control_write().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      15a27c36
    • T
      cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive · ce3f1d9d
      Tejun Heo 提交于
      The three factored out css management operations -
      cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
      only depend on the current state of the target cgroups and idempotent
      and thus can be easily made to operate on the subtree instead of the
      immediate children.
      
      This patch introduces the iterators which walk live subtree and
      converts the three functions to operate on the subtree including self
      instead of the children.  While this leads to spurious walking and be
      slightly more expensive, it will allow them to be used for wider scope
      of operations.
      
      Note that cgroup_drain_offline() now tests for whether a css is dying
      before trying to drain it.  This is to avoid trying to drain live
      csses as there can be mix of live and dying csses in a subtree unlike
      children of the same parent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      ce3f1d9d
    • T
      cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() · bdb53bd7
      Tejun Heo 提交于
      Factor out css enabling and showing into cgroup_apply_control_enable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable, simply
        enable or show csses according to the current cgroup_control() and
        cgroup_ss_mask().  This leads to the same result and is simpler and
        more robust.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      bdb53bd7
    • T
      cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write() · 12b3bb6a
      Tejun Heo 提交于
      Factor out css disabling and hiding into cgroup_apply_control_disable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable and
        @css_disable, simply disable or hide csses according to the current
        cgroup_control() and cgroup_ss_mask().  This leads to the same
        result and is simpler and more robust.
      
      * This allows error handling path to share the same code.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      12b3bb6a
    • T
      cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write() · 1b9b96a1
      Tejun Heo 提交于
      Factor out async css offline draining into cgroup_drain_offline().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Relocate the draining above subsystem mask preparation, which
        doesn't create any behavior differences but helps further
        refactoring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      1b9b96a1
    • T
      cgroup: introduce cgroup_control() and cgroup_ss_mask() · 5531dc91
      Tejun Heo 提交于
      When a controller is enabled and visible on a non-root cgroup is
      determined by subtree_control and subtree_ss_mask of the parent
      cgroup.  For a root cgroup, by the type of the hierarchy and which
      controllers are attached to it.  Deciding the above on each usage is
      fragile and unnecessarily complicates the users.
      
      This patch introduces cgroup_control() and cgroup_ss_mask() which
      calculate and return the [visibly] enabled subsyste mask for the
      specified cgroup and conver the existing usages.
      
      * cgroup_e_css() is restructured for simplicity.
      
      * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
        longer need to distinguish root and non-root cases.
      
      * With cgroup_control(), cgroup_controllers_show() can now handle both
        root and non-root cases.  cgroup_root_controllers_show() is removed.
      
      v2: cgroup_control() updated to yield the correct result on v1
          hierarchies too.  cgroup_subtree_control_write() converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      5531dc91
    • T
      cgroup: factor out cgroup_create() out of cgroup_mkdir() · a5bca215
      Tejun Heo 提交于
      We're in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.  This patch factors out cgroup_create() out of
      cgroup_mkdir().  cgroup_create() contains all internal object creation
      and initialization.  cgroup_mkdir() uses cgroup_create() to create the
      internal cgroup and adds interface directory and file creation.
      
      This patch doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      a5bca215
    • T
      cgroup: reorder operations in cgroup_mkdir() · 195e9b6c
      Tejun Heo 提交于
      Currently, operations to initialize internal objects and create
      interface directory and files are intermixed in cgroup_mkdir().  We're
      in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.
      
      This patch reorders operations inside cgroup_mkdir() so that interface
      directory and file handling comes after internal object
      initialization.  This will enable further refactoring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      195e9b6c
    • T
      cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9
      Tejun Heo 提交于
      Currently, whether a css (cgroup_subsys_state) has its interface files
      created is not tracked and assumed to change together with the owning
      cgroup's lifecycle.  cgroup directory and interface creation is being
      separated out from internal object creation to help refactoring and
      eventually allow cgroups which are not visible through cgroupfs.
      
      This patch adds CSS_VISIBLE to track whether a css has its interface
      files created and perform management operations only when necessary
      which helps decoupling interface file handling from internal object
      lifecycle.  After this patch, all css interface file management
      functions can be called regardless of the current state and will
      achieve the expected result.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      88cb04b9
    • T
      cgroup: separate out interface file creation from css creation · 6cd0f5bb
      Tejun Heo 提交于
      Currently, interface files are created when a css is created depending
      on whether @visible is set.  This patch separates out the two into
      separate steps to help code refactoring and eventually allow cgroups
      which aren't visible through cgroup fs.
      
      Move css_populate_dir() out of create_css() and drop @visible.  While
      at it, rename the function to css_create() for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      6cd0f5bb