1. 14 5月, 2014 25 次提交
    • T
      cgroup: move cgroup->sibling unlinking to cgroup_put() · 4e4e2847
      Tejun Heo 提交于
      Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
      cgroup_put().  This is later but still before the RCU grace period, so
      it doesn't break css_next_child() although there now is a larger
      window in which a dead cgroup is visible during css iteration.  As css
      iteration always could have included offline csses, this doesn't
      affect correctness; however, it does make css_next_child() fall back
      to reiterting mode more often.  This also makes cgroup_put() directly
      take cgroup_mutex, which limits where it can be called from.  These
      are not immediately problematic and will be dealt with later.
      
      This change enables simplification of cgroup destruction path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4e4e2847
    • T
      cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked() · 9e4173e1
      Tejun Heo 提交于
      Currently, check_for_release() on the parent of a destroyed cgroup is
      invoked from cgroup_destroy_css_killed().  This is because this is
      where the destroyed cgroup can be removed from the parent's children
      list.  check_for_release() tests the emptiness of the list directly,
      so invoking it before removing the cgroup from the list makes it think
      that the parent still has children even when it no longer does.
      
      This patch updates check_for_release() to use
      cgroup_has_live_children() instead of directly testing ->children
      emptiness and moves check_for_release(parent) earlier to the end of
      cgroup_destroy_locked().  As cgroup_has_live_children() ignores
      cgroups marked DEAD, check_for_release() functions correctly as long
      as it's called after asserting DEAD.
      
      This makes release notification slightly more timely and more
      importantly enables further simplification of cgroup destruction path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9e4173e1
    • T
      cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked() · cbc125ef
      Tejun Heo 提交于
      We're expecting another user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cbc125ef
    • T
      cgroup: rename cgroup->dummy_css to ->self and move it to the top · 9d800df1
      Tejun Heo 提交于
      cgroup->dummy_css is used as the placeholder css when performing css
      oriended operations on the cgroup.  We're gonna shift more cgroup
      management to this css.  Let's rename it to ->self and move it to the
      top.
      
      This is pure rename and field relocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9d800df1
    • T
      cgroup: use restart_syscall() for mount retries · a015edd2
      Tejun Heo 提交于
      cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
      which is being destroyed.  The retry currently loops inside
      cgroup_mount() proper.  This patch makes it return with
      restart_syscall() instead so that retry travels out to userland
      boundary.
      
      This slightly simplifies the logic and more importantly makes the
      retry logic behave better when the wait for some reason becomes
      lengthy or infinite by allowing the operation to be suspended or
      terminated from userland.
      
      v2: The original patch forgot to free memory allocated for @opts.
          Fixed.  Caught by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a015edd2
    • T
      cgroup: remove cgroup_tree_mutex · 8353da1f
      Tejun Heo 提交于
      cgroup_tree_mutex was introduced to work around the circular
      dependency between cgroup_mutex and kernfs active protection - some
      kernfs file and directory operations needed cgroup_mutex putting
      cgroup_mutex under active protection but cgroup also needs to be able
      to access cgroup hierarchies and cftypes to determine which
      kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
      cgroup_mutex and kernfs active protection and used to protect the
      hierarchy and cftypes.  While this worked, it added a lot of double
      lockings and was generally cumbersome.
      
      kernfs provides a mechanism to opt out of active protection and cgroup
      was already using it for removal and subtree_control.  There's no
      reason to mix both methods of avoiding circular locking dependency and
      the preceding cgroup_kn_lock_live() changes applied it to all relevant
      cgroup kernfs operations making it unnecessary to nest cgroup_mutex
      under kernfs active protection.  The previous patch reversed the
      original lock ordering and put cgroup_mutex above kernfs active
      protection.
      
      After these changes, all cgroup_tree_mutex usages are now accompanied
      by cgroup_mutex making the former completely redundant.  This patch
      removes cgroup_tree_mutex and all its usages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8353da1f
    • T
      cgroup: nest kernfs active protection under cgroup_mutex · 01f6474c
      Tejun Heo 提交于
      After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
      longer nested below kernfs active protection.  The two don't have any
      relationship now.
      
      This patch nests kernfs active protection under cgroup_mutex.  All
      cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
      temporary cgroup_mutex releases over kernfs operations are removed,
      and cgroup_add/rm_cftypes() grab both mutexes.
      
      This makes cgroup_tree_mutex redundant, which will be removed by the
      next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      01f6474c
    • T
      cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods · e76ecaee
      Tejun Heo 提交于
      Make __cgroup_procs_write() and cgroup_release_agent_write() use
      cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
      cgroup_lock_live_group().  This puts the operations under both
      cgroup_tree_mutex and cgroup_mutex protection without circular
      dependency from kernfs active protection.  Also, this means that
      cgroup_mutex is no longer nested below kernfs active protection.
      There is no longer any place where the two locks interact.
      
      This leaves cgroup_lock_live_group() without any user.  Removed.
      
      This will help simplifying cgroup locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e76ecaee
    • T
      cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() · a9746d8d
      Tejun Heo 提交于
      cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
      share the logic to break active protection so that they can grab
      cgroup_tree_mutex which nests above active protection and/or remove
      self.  Factor out this logic into cgroup_kn_lock_live() and
      cgroup_kn_unlock().
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a9746d8d
    • T
      cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() · cfc79d5b
      Tejun Heo 提交于
      The ->priv field of a cgroup directory kernfs_node points back to the
      cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
      non-kernfs accesses from css_tryget_from_dir() and
      cgroupstats_build().
      
      As these are only applicable to cgroups which finished creation
      successfully and fully initialized cgroups are always removed by
      cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().
      
      This will help simplifying cgroup locking and shouldn't introduce any
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cfc79d5b
    • T
      cgroup: grab cgroup_mutex earlier in cgroup_subtree_control_write() · ddab2b6e
      Tejun Heo 提交于
      Move cgroup_lock_live_group() invocation upwards to right below
      cgroup_tree_mutex in cgroup_subtree_control_write().  This is to help
      the planned locking simplification.
      
      This doesn't make any userland-visible behavioral changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ddab2b6e
    • T
      cgroup: collapse cgroup_create() into croup_mkdir() · b3bfd983
      Tejun Heo 提交于
      cgroup_mkdir() is the sole user of cgroup_create().  Let's collapse
      the latter into the former.  This will help simplifying locking.
      While at it, remove now stale comment about inode locking.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b3bfd983
    • T
      cgroup: reorganize cgroup_create() · ba0f4d76
      Tejun Heo 提交于
      Reorganize cgroup_create() so that all paths share unlock out path.
      
      * All err_* labels are renamed to out_* as they're now shared by both
        success and failure paths.
      
      * @err renamed to @ret for the similar reason as above and so that
        it's more consistent with other functions.
      
      * cgroup memory allocation moved after locking so that freeing failed
        cgroup happens before unlocking.  While this moves more code inside
        critical section, memory allocations inside cgroup locking are
        already pretty common and this is unlikely to make any noticeable
        difference.
      
      * While at it, replace a stray @parent->root dereference with @root.
      
      This reorganization will help simplifying locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ba0f4d76
    • T
      cgroup: remove cgroup->control_kn · b7fc5ad2
      Tejun Heo 提交于
      Now that cgroup_subtree_control_write() has access to the associated
      kernfs_open_file and thus the kernfs_node, there's no need to cache it
      in cgroup->control_kn on creation.  Remove cgroup->control_kn and use
      @of->kn directly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b7fc5ad2
    • T
      cgroup: convert "tasks" and "cgroup.procs" handle to use cftype->write() · acbef755
      Tejun Heo 提交于
      cgroup_tasks_write() and cgroup_procs_write() are currently using
      cftype->write_u64().  This patch converts them to use cftype->write()
      instead.  This allows access to the associated kernfs_open_file which
      will be necessary to implement the planned kernfs active protection
      manipulation for these files.
      
      This shifts buffer parsing to attach_task_by_pid() and makes it return
      @nbytes on success.  Let's rename it to __cgroup_procs_write() to
      clearly indicate that this is a write handler implementation.
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      acbef755
    • T
      cgroup: replace cftype->trigger() with cftype->write() · 6770c64e
      Tejun Heo 提交于
      cftype->trigger() is pointless.  It's trivial to ignore the input
      buffer from a regular ->write() operation.  Convert all ->trigger()
      users to ->write() and remove ->trigger().
      
      This patch doesn't introduce any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      6770c64e
    • T
      cgroup: replace cftype->write_string() with cftype->write() · 451af504
      Tejun Heo 提交于
      Convert all cftype->write_string() users to the new cftype->write()
      which maps directly to kernfs write operation and has full access to
      kernfs and cgroup contexts.  The conversions are mostly mechanical.
      
      * @css and @cft are accessed using of_css() and of_cft() accessors
        respectively instead of being specified as arguments.
      
      * Should return @nbytes on success instead of 0.
      
      * @buf is not trimmed automatically.  Trim if necessary.  Note that
        blkcg and netprio don't need this as the parsers already handle
        whitespaces.
      
      cftype->write_string() has no user left after the conversions and
      removed.
      
      While at it, remove unnecessary local variable @p in
      cgroup_subtree_control_write() and stale comment about
      CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.
      
      This patch doesn't introduce any visible behavior changes.
      
      v2: netprio was missing from conversion.  Converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      451af504
    • T
      cgroup: implement cftype->write() · b4168640
      Tejun Heo 提交于
      During the recent conversion to kernfs, cftype's seq_file operations
      are updated so that they are directly mapped to kernfs operations and
      thus can fully access the associated kernfs and cgroup contexts;
      however, write path hasn't seen similar updates and none of the
      existing write operations has access to, for example, the associated
      kernfs_open_file.
      
      Let's introduce a new operation cftype->write() which maps directly to
      the kernfs write operation and has access to all the arguments and
      contexts.  This will replace ->write_string() and ->trigger() and ease
      manipulation of kernfs active protection from cgroup file operations.
      
      Two accessors - of_cft() and of_css() - are introduced to enable
      accessing the associated cgroup context from cftype->write() which
      only takes kernfs_open_file for the context information.  The
      accessors for seq_file operations - seq_cft() and seq_css() - are
      rewritten to wrap the of_ accessors.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b4168640
    • T
      cgroup: rename css_tryget*() to css_tryget_online*() · ec903c0c
      Tejun Heo 提交于
      Unlike the more usual refcnting, what css_tryget() provides is the
      distinction between online and offline csses instead of protection
      against upping a refcnt which already reached zero.  cgroup is
      planning to provide actual tryget which fails if the refcnt already
      reached zero.  Let's rename the existing trygets so that they clearly
      indicate that they're onliness.
      
      I thought about keeping the existing names as-are and introducing new
      names for the planned actual tryget; however, given that each
      controller participates in the synchronization of the online state, it
      seems worthwhile to make it explicit that these functions are about
      on/offline state.
      
      Rename css_tryget() to css_tryget_online() and css_tryget_from_dir()
      to css_tryget_online_from_dir().  This is pure rename.
      
      v2: cgroup_freezer grew new usages of css_tryget().  Update
          accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      ec903c0c
    • T
      cgroup: use release_agent_path_lock in cgroup_release_agent_show() · 46cfeb04
      Tejun Heo 提交于
      release_path is now protected by release_agent_path_lock to allow
      accessing it without grabbing cgroup_mutex; however,
      cgroup_release_agent_show() was still grabbing cgroup_mutex.  Let's
      convert it to release_agent_path_lock so that we don't have to worry
      about this one for the planned locking updates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      46cfeb04
    • T
      cgroup: use restart_syscall() for retries after offline waits in cgroup_subtree_control_write() · 7d331fa9
      Tejun Heo 提交于
      After waiting for a child to finish offline,
      cgroup_subtree_control_write() jumps up to retry from after the input
      parsing and active protection breaking.  This retry makes the
      scheduled locking update - removal of cgroup_tree_mutex - more
      difficult.  Let's simplify it by returning with restart_syscall() for
      retries.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7d331fa9
    • T
      cgroup: update and fix parsing of "cgroup.subtree_control" · d37167ab
      Tejun Heo 提交于
      I was confused that strsep() was equivalent to strtok_r() in skipping
      over consecutive delimiters.  strsep() just splits at the first
      occurrence of one of the delimiters which makes the parsing very
      inflexible, which makes allowing multiple whitespace chars as
      delimters kinda moot.  Let's just be consistently strict and require
      list of tokens separated by spaces.  This is what
      Documentation/cgroups/unified-hierarchy.txt describes too.
      
      Also, parsing may access beyond the end of the string if the string
      ends with spaces or is zero-length.  Make sure it skips zero-length
      tokens.  Note that this also ensures that the parser doesn't puke on
      multiple consecutive spaces.
      
      v2: Add zero-length token skipping.
      
      v3: Added missing space after "==".  Spotted by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d37167ab
    • T
      cgroup: css_release() shouldn't clear cgroup->subsys[] · 0ab7a60d
      Tejun Heo 提交于
      c1a71504 ("cgroup: don't recycle cgroup id until all csses' have
      been destroyed") made cgroup ID persist until a cgroup is released and
      add cgroup->subsys[] clearing to css_release() so that css_from_id()
      doesn't return a css which has already been released which happens
      before cgroup release; however, the right change here was updating
      offline_css() to clear cgroup->subsys[] which was done by e3297803
      ("cgroup: cgroup->subsys[] should be cleared after the css is
      offlined") instead of clearing it from css_release().
      
      We're now clearing cgroup->subsys[] twice.  This is okay for
      traditional hierarchies as a css's lifetime is the same as its
      cgroup's; however, this confuses unified hierarchy and turning on and
      off a controller repeatedly using "cgroup.subtree_control" can lead to
      an oops like the following which happens because cgroup->subsys[] is
      incorrectly cleared asynchronously by css_release().
      
       BUG: unable to handle kernel NULL pointer dereference at 00000000000000 08
       IP: [<ffffffff81130c11>] kill_css+0x21/0x1c0
       PGD 1170d067 PUD f0ab067 PMD 0
       Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       Modules linked in:
       CPU: 2 PID: 459 Comm: bash Not tainted 3.15.0-rc2-work+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
       task: ffff880009296710 ti: ffff88000e198000 task.ti: ffff88000e198000
       RIP: 0010:[<ffffffff81130c11>]  [<ffffffff81130c11>] kill_css+0x21/0x1c0
       RSP: 0018:ffff88000e199dc8  EFLAGS: 00010202
       RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
       RDX: 0000000000000001 RSI: ffffffff8238a968 RDI: ffff880009296f98
       RBP: ffff88000e199de0 R08: 0000000000000001 R09: 02b0000000000000
       R10: 0000000000000000 R11: ffff880009296fc0 R12: 0000000000000001
       R13: ffff88000db6fc58 R14: 0000000000000001 R15: ffff8800139dcc00
       FS:  00007ff9160c5740(0000) GS:ffff88001fb00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000008 CR3: 0000000013947000 CR4: 00000000000006e0
       Stack:
        ffff88000e199de0 ffffffff82389160 0000000000000001 ffff88000e199e80
        ffffffff8113537f 0000000000000007 ffff88000e74af00 ffff88000e199e48
        ffff880009296710 ffff88000db6fc00 ffffffff8239c100 0000000000000002
       Call Trace:
        [<ffffffff8113537f>] cgroup_subtree_control_write+0x85f/0xa00
        [<ffffffff8112fd18>] cgroup_file_write+0x38/0x1d0
        [<ffffffff8126fc97>] kernfs_fop_write+0xe7/0x170
        [<ffffffff811f2ae6>] vfs_write+0xb6/0x1c0
        [<ffffffff811f35ad>] SyS_write+0x4d/0xc0
        [<ffffffff81d0acd2>] system_call_fastpath+0x16/0x1b
       Code: 5c 41 5d 41 5e 41 5f 5d c3 90 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb 48 83 ec 08 8b 05 37 ad 29 01 85 c0 0f 85 df 00 00 00 <48> 8b 43 08 48 8b 3b be 01 00 00 00 8b 48 5c d3 e6 e8 49 ff ff
       RIP  [<ffffffff81130c11>] kill_css+0x21/0x1c0
        RSP <ffff88000e199dc8>
       CR2: 0000000000000008
       ---[ end trace e7aae1f877c4e1b4 ]---
      
      Remove the unnecessary cgroup->subsys[] clearing from css_release().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0ab7a60d
    • T
      cgroup: cgroup_idr_lock should be bh · 54504e97
      Tejun Heo 提交于
      cgroup_idr_remove() can be invoked from bh leading to lockdep
      detecting possible AA deadlock (IN_BH/ON_BH).  Make the lock bh-safe.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      54504e97
    • T
      cgroup: fix offlining child waiting in cgroup_subtree_control_write() · 0cee8b77
      Tejun Heo 提交于
      cgroup_subtree_control_write() waits for offline to complete
      child-by-child before enabling a controller; however, it has a couple
      bugs.
      
      * It doesn't initialize the wait_queue_t.  This can lead to infinite
        hang on the following schedule() among other things.
      
      * It forgets to pin the child before releasing cgroup_tree_mutex and
        performing schedule().  The child may already be gone by the time it
        wakes up and invokes finish_wait().  Pin the child being waited on.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0cee8b77
  2. 13 5月, 2014 1 次提交
    • T
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo 提交于
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5024ae29
  3. 06 5月, 2014 1 次提交
  4. 05 5月, 2014 6 次提交
    • T
      cgroup, memcg: implement css->id and convert css_from_id() to use it · 15a4c835
      Tejun Heo 提交于
      Until now, cgroup->id has been used to identify all the associated
      csses and css_from_id() takes cgroup ID and returns the matching css
      by looking up the cgroup and then dereferencing the css associated
      with it; however, now that the lifetimes of cgroup and css are
      separate, this is incorrect and breaks on the unified hierarchy when a
      controller is disabled and enabled back again before the previous
      instance is released.
      
      This patch adds css->id which is a subsystem-unique ID and converts
      css_from_id() to look up by the new css->id instead.  memcg is the
      only user of css_from_id() and also converted to use css->id instead.
      
      For traditional hierarchies, this shouldn't make any functional
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      15a4c835
    • T
      cgroup: update init_css() into init_and_link_css() · ddfcadab
      Tejun Heo 提交于
      init_css() takes the cgroup the new css belongs to as an argument and
      initializes the new css's ->cgroup and ->parent pointers but doesn't
      acquire the matching reference counts.  After the previous patch,
      create_css() puts init_css() and reference acquisition right next to
      each other.  Let's move reference acquistion into init_css() and
      rename the function to init_and_link_css().  This makes sense and is
      easier to follow.  This makes the root csses to hold a reference on
      cgrp_dfl_root.cgrp, which is harmless.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ddfcadab
    • T
      cgroup: use RCU free in create_css() failure path · a2bed820
      Tejun Heo 提交于
      Currently, when create_css() fails in the middle, the half-initialized
      css is freed by invoking cgroup_subsys->css_free() directly.  This
      patch updates the function so that it invokes RCU free path instead.
      As the RCU free path puts the parent css and owning cgroup, their
      references are now acquired right after a new css is successfully
      allocated.
      
      This doesn't make any visible difference now but is to enable
      implementing css->id and RCU protected lookup by such IDs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a2bed820
    • T
      cgroup: protect cgroup_root->cgroup_idr with a spinlock · 6fa4918d
      Tejun Heo 提交于
      Currently, cgroup_root->cgroup_idr is protected by cgroup_mutex, which
      ends up requiring cgroup_put() to be invoked under sleepable context.
      This is okay for now but is an unusual requirement and we'll soon add
      css->id which will have the same problem but won't be able to simply
      grab cgroup_mutex as removal will have to happen from css_release()
      which can't sleep.
      
      Introduce cgroup_idr_lock and idr_alloc/replace/remove() wrappers
      which protects the idr operations with the lock and use them for
      cgroup_root->cgroup_idr.  cgroup_put() no longer needs to grab
      cgroup_mutex and css_from_id() is updated to always require RCU read
      lock instead of either RCU read lock or cgroup_mutex, which doesn't
      affect the existing users.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6fa4918d
    • T
      cgroup, memcg: allocate cgroup ID from 1 · 7d699ddb
      Tejun Heo 提交于
      Currently, cgroup->id is allocated from 0, which is always assigned to
      the root cgroup; unfortunately, memcg wants to use ID 0 to indicate
      invalid IDs and ends up incrementing all IDs by one.
      
      It's reasonable to reserve 0 for special purposes.  This patch updates
      cgroup core so that ID 0 is not used and the root cgroups get ID 1.
      The ID incrementing is removed form memcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7d699ddb
    • T
      cgroup: make flags and subsys_masks unsigned int · 69dfa00c
      Tejun Heo 提交于
      There's no reason to use atomic bitops for cgroup_subsys_state->flags,
      cgroup_root->flags and various subsys_masks.  This patch updates those
      to use bitwise and/or operations instead and converts them form
      unsigned long to unsigned int.
      
      This makes the fields occupy (marginally) smaller space and makes it
      clear that they don't require atomicity.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      69dfa00c
  5. 26 4月, 2014 5 次提交
    • J
      cgroup: Use more current logging style · ed3d261b
      Joe Perches 提交于
      Use pr_fmt and remove embedded prefixes.
      Realign modified multi-line statements to open parenthesis.
      Convert embedded function name to "%s: ", __func__
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ed3d261b
    • J
      cgroup: replace pr_warning with preferred pr_warn · a2a1f9ea
      Jianyu Zhan 提交于
      As suggested by scripts/checkpatch.pl, substitude all pr_warning()
      with pr_warn().
      
      No functional change.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a2a1f9ea
    • J
      cgroup: remove orphaned cgroup_pidlist_seq_operations · f8719ccf
      Jianyu Zhan 提交于
      6612f05b ("cgroup: unify pidlist and other file handling")
      has removed the only user of cgroup_pidlist_seq_operations :
      cgroup_pidlist_open().
      
      This patch removes it.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f8719ccf
    • J
      cgroup: clean up obsolete comment for parse_cgroupfs_options() · 2f0edc04
      Jianyu Zhan 提交于
      1d5be6b2 ("cgroup: move module ref handling into
      rebind_subsystems()") makes parse_cgroupfs_options() no longer takes
      refcounts on subsystems.
      
      And unified hierachy makes parse_cgroupfs_options not need to call
      with cgroup_mutex held to protect the cgroup_subsys[].
      
      So this patch removes BUG_ON() and the comment.  As the comment
      doesn't contain useful information afterwards, the whole comment is
      removed.
      Signed-off-by: NJianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2f0edc04
    • T
      cgroup: implement cgroup.populated for the default hierarchy · 842b597e
      Tejun Heo 提交于
      cgroup users often need a way to determine when a cgroup's
      subhierarchy becomes empty so that it can be cleaned up.  cgroup
      currently provides release_agent for it; unfortunately, this mechanism
      is riddled with issues.
      
      * It delivers events by forking and execing a userland binary
        specified as the release_agent.  This is a long deprecated method of
        notification delivery.  It's extremely heavy, slow and cumbersome to
        integrate with larger infrastructure.
      
      * There is single monitoring point at the root.  There's no way to
        delegate management of a subtree.
      
      * The event isn't recursive.  It triggers when a cgroup doesn't have
        any tasks or child cgroups.  Events for internal nodes trigger only
        after all children are removed.  This again makes it impossible to
        delegate management of a subtree.
      
      * Events are filtered from the kernel side.  "notify_on_release" file
        is used to subscribe to or suppress release event.  This is
        unnecessarily complicated and probably done this way because event
        delivery itself was expensive.
      
      This patch implements interface file "cgroup.populated" which can be
      used to monitor whether the cgroup's subhierarchy has tasks in it or
      not.  Its value is 0 if there is no task in the cgroup and its
      descendants; otherwise, 1, and kernfs_notify() notificaiton is
      triggers when the value changes, which can be monitored through poll
      and [di]notify.
      
      This is a lot ligther and simpler and trivially allows delegating
      management of subhierarchy - subhierarchy monitoring can block further
      propgation simply by putting itself or another process in the root of
      the subhierarchy and monitor events that it's interested in from there
      without interfering with monitoring higher in the tree.
      
      v2: Patch description updated as per Serge.
      
      v3: "cgroup.subtree_populated" renamed to "cgroup.populated".  The
          subtree_ prefix was a bit confusing because
          "cgroup.subtree_control" uses it to denote the tree rooted at the
          cgroup sans the cgroup itself while the populated state includes
          the cgroup itself.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Lennart Poettering <lennart@poettering.net>
      842b597e
  6. 23 4月, 2014 2 次提交
    • T
      cgroup: implement dynamic subtree controller enable/disable on the default hierarchy · f8f22e53
      Tejun Heo 提交于
      cgroup is switching away from multiple hierarchies and will use one
      unified default hierarchy where controllers can be dynamically enabled
      and disabled per subtree.  The default hierarchy will serve as the
      unified hierarchy to which all controllers are attached and a css on
      the default hierarchy would need to also serve the tasks of descendant
      cgroups which don't have the controller enabled - ie. the tree may be
      collapsed from leaf towards root when viewed from specific
      controllers.  This has been implemented through effective css in the
      previous patches.
      
      This patch finally implements dynamic subtree controller
      enable/disable on the default hierarchy via a new knob -
      "cgroup.subtree_control" which controls which controllers are enabled
      on the child cgroups.  Let's assume a hierarchy like the following.
      
        root - A - B - C
                     \ D
      
      root's "cgroup.subtree_control" determines which controllers are
      enabled on A.  A's on B.  B's on C and D.  This coincides with the
      fact that controllers on the immediate sub-level are used to
      distribute the resources of the parent.  In fact, it's natural to
      assume that resource control knobs of a child belong to its parent.
      Enabling a controller in "cgroup.subtree_control" declares that
      distribution of the respective resources of the cgroup will be
      controlled.  Note that this means that controller enable states are
      shared among siblings.
      
      The default hierarchy has an extra restriction - only cgroups which
      don't contain any task may have controllers enabled in
      "cgroup.subtree_control".  Combined with the other properties of the
      default hierarchy, this guarantees that, from the view point of
      controllers, tasks are only on the leaf cgroups.  In other words, only
      leaf csses may contain tasks.  This rules out situations where child
      cgroups compete against internal tasks of the parent, which is a
      competition between two different types of entities without any clear
      way to determine resource distribution between the two.  Different
      controllers handle it differently and all the implemented behaviors
      are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
      structural constraints imposed from cgroup core removes the burden
      from controller implementations and enables showing one consistent
      behavior across all controllers.
      
      When a controller is enabled or disabled, css associations for the
      controller in the subtrees of each child should be updated.  After
      enabling, the whole subtree of a child should point to the new css of
      the child.  After disabling, the whole subtree of a child should point
      to the cgroup's css.  This is implemented by first updating cgroup
      states such that cgroup_e_css() result points to the appropriate css
      and then invoking cgroup_update_dfl_csses() which migrates all tasks
      in the affected subtrees to the self cgroup on the default hierarchy.
      
      * When read, "cgroup.subtree_control" lists all the currently enabled
        controllers on the children of the cgroup.
      
      * White-space separated list of controller names prefixed with either
        '+' or '-' can be written to "cgroup.subtree_control".  The ones
        prefixed with '+' are enabled on the controller and '-' disabled.
      
      * A controller can be enabled iff the parent's
        "cgroup.subtree_control" enables it and disabled iff no child's
        "cgroup.subtree_control" has it enabled.
      
      * If a cgroup has tasks, no controller can be enabled via
        "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
        some controllers enabled, tasks can't be migrated into the cgroup.
      
      * All controllers which aren't bound on other hierarchies are
        automatically associated with the root cgroup of the default
        hierarchy.  All the controllers which are bound to the default
        hierarchy are listed in the read-only file "cgroup.controllers" in
        the root directory.
      
      * "cgroup.controllers" in all non-root cgroups is read-only file whose
        content is equal to that of "cgroup.subtree_control" of the parent.
        This indicates which controllers can be used in the cgroup's
        "cgroup.subtree_control".
      
      This is still experimental and there are some holes, one of which is
      that ->can_attach() failure during cgroup_update_dfl_csses() may leave
      the cgroups in an undefined state.  The issues will be addressed by
      future patches.
      
      v2: Non-root cgroups now also have "cgroup.controllers".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f8f22e53
    • T
      cgroup: prepare migration path for unified hierarchy · f817de98
      Tejun Heo 提交于
      Unified hierarchy implementation would require re-migrating tasks onto
      the same cgroup on the default hierarchy to reflect updated effective
      csses.  Update cgroup_migrate_prepare_dst() so that it accepts NULL as
      the destination cgrp.  When NULL is specified, the destination is
      considered to be the cgroup on the default hierarchy associated with
      each css_set.
      
      After this change, the identity check in cgroup_migrate_add_src()
      isn't sufficient for noop detection as the associated csses may change
      without any cgroup association changing.  The only way to tell whether
      a migration is noop or not is testing whether the source and
      destination csets are identical.  The noop check in
      cgroup_migrate_add_src() is removed and cset identity test is added to
      cgroup_migreate_prepare_dst().  If it's detected that source and
      destination csets are identical, the cset is removed removed from
      @preloaded_csets and all the migration nodes are cleared which makes
      cgroup_migrate() ignore the cset.
      
      Also, make the function append the destination css_sets to
      @preloaded_list so that destination css_sets always come after source
      css_sets.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f817de98