1. 29 11月, 2013 7 次提交
    • T
      cgroup: load and release pidlists from seq_file start and stop respectively · 4bac00d1
      Tejun Heo 提交于
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      The previous patches implemented delayed release and restructured
      pidlist handling so that pidlists can be loaded and released from
      seq_file start / stop.  This patch actually moves pidlist load to
      start and release to stop.
      
      This means that pidlist is pinned only between start and stop and may
      go away between two consecutive read calls if the two calls are apart
      by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
      thus can't re-use the stored cgroup_pid_list_open_file->pidlist
      directly.  During start, it's only used as a hint indicating whether
      this is the first start after open or not and pidlist is always looked
      up or created.
      
      pidlist_mutex locking and reference counting are moved out of
      pidlist_array_load() so that pidlist_array_load() can perform lookup
      and creation atomically.  While this enlarges the area covered by
      pidlist_mutex, given how the lock is used, it's highly unlikely to be
      noticeable.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4bac00d1
    • T
      cgroup: remove cgroup_pidlist->rwsem · 069df3b7
      Tejun Heo 提交于
      cgroup_pidlist locking is needlessly complicated.  It has outer
      cgroup->pidlist_mutex to protect the list of pidlists associated with
      a cgroup and then each pidlist has rwsem to synchronize updates and
      reads.  Given that the only read access is from seq_file operations
      which are always invoked back-to-back, the rwsem is a giant overkill.
      All it does is adding unnecessary complexity.
      
      This patch removes cgroup_pidlist->rwsem and protects all accesses to
      pidlists belonging to a cgroup with cgroup->pidlist_mutex.
      pidlist->rwsem locking is removed if it's nested inside
      cgroup->pidlist_mutex; otherwise, it's replaced with
      cgroup->pidlist_mutex locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      069df3b7
    • T
      cgroup: refactor cgroup_pidlist_find() · e6b81710
      Tejun Heo 提交于
      Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
      separate out finding proper to cgroup_pidlist_find().  Also, move
      locking to the caller.
      
      This patch is preparation for pidlist restructure and doesn't
      introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e6b81710
    • T
      cgroup: introduce struct cgroup_pidlist_open_file · 62236858
      Tejun Heo 提交于
      For pidlist files, seq_file->private pointed to the loaded
      cgroup_pidlist; however, pidlist loading is planned to be moved to
      cgroup_pidlist_start() for kernfs conversion and seq_file->private
      needs to carry more information from open to allow that.
      
      This patch introduces struct cgroup_pidlist_open_file which contains
      type, cgrp and pidlist and updates pidlist seq_file->private to point
      to it using seq_open_private() and seq_release_private().  Note that
      this eventually will be replaced by kernfs_open_file.
      
      While this patch makes more information available to seq_file
      operations, they don't use it yet and this patch doesn't introduce any
      behavior changes except for allocation of the extra private struct.
      
      v2: use __seq_open_private() instead of seq_open_private() for brevity
          as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      62236858
    • T
      cgroup: implement delayed destruction for cgroup_pidlist · b1a21367
      Tejun Heo 提交于
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      This patch implements delayed release of pidlist.  As pidlists could
      be lingering on cgroup removal waiting for the timer to expire, cgroup
      free path needs to queue the destruction work item immediately and
      flush.  As those work items are self-destroying, each work item can't
      be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
      added to serve as flush domain.
      
      Note that this patch just adds delayed release on top of the current
      implementation and doesn't change where pidlist is loaded and
      released.  Following patches will make those changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b1a21367
    • T
      cgroup: remove cftype->release() · b9f3ceca
      Tejun Heo 提交于
      Now that pidlist files don't use cftype->release(), it doesn't have
      any user left.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b9f3ceca
    • T
      cgroup: don't skip seq_open on write only opens on pidlist files · ac1e69aa
      Tejun Heo 提交于
      Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
      if the file is opened write-only, which is a sensible optimization as
      pidlist loading can be costly and there often are occasions where
      tasks or cgroup.procs is opened write-only.  However, pidlist init and
      release are planned to be moved to cgroup_pidlist_start/stop()
      respectively which would make this optimization unnecessary.
      
      This patch removes the optimization and always fully initializes
      pidlist files regardless of open mode.  This will help moving pidlist
      handling to start/stop by unifying rw paths and removes the need for
      specifying cftype->release() in addition to .release in
      cgroup_pidlist_operations as file->f_op is now always overridden.  As
      pidlist files were the only user of cftype->release(), the next patch
      will remove the method.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ac1e69aa
  2. 28 11月, 2013 1 次提交
    • T
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo 提交于
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
  3. 23 11月, 2013 4 次提交
    • T
      cgroup: unexport cgroup_css() and remove __file_cft() · b36824c7
      Tejun Heo 提交于
      Now that cgroup_event is made memcg specific, the temporarily exported
      functions are no longer necessary.  Unexport cgroup_css() and remove
      __file_cft() which doesn't have any user left.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      b36824c7
    • T
      cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg · fba94807
      Tejun Heo 提交于
      cgroup_event is being moved from cgroup core to memcg and the
      implementation is already moved by the previous patch.  This patch
      moves the data fields and callbacks.
      
      * cgroup->event_list[_lock] are moved to mem_cgroup.
      
      * cftype->[un]register_event() are moved to cgroup_event.  This makes
        it impossible for individual cftype definitions to specify their
        event callbacks.  This is worked around by simply hard-coding
        filename to event callback mapping in cgroup_write_event_control().
        This is awkward and inflexible, which is actually desirable given
        that we don't want to grow more usages of this feature.
      
      * eventfd_ctx declaration is removed from cgroup.h, which makes
        vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
        vmpressure.h.
      
      v2: Use file name from dentry instead of cftype.  This will allow
          removing all cftype handling in the function.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      fba94807
    • T
      cgroup, memcg: move cgroup_event implementation to memcg · 79bd9814
      Tejun Heo 提交于
      cgroup_event is way over-designed and tries to build a generic
      flexible event mechanism into cgroup - fully customizable event
      specification for each user of the interface.  This is utterly
      unnecessary and overboard especially in the light of the planned
      unified hierarchy as there's gonna be single agent.  Simply generating
      events at fixed points, or if that's too restrictive, configureable
      cadence or single set of configureable points should be enough.
      
      Thankfully, memcg is the only user and gets to keep it.  Replacing it
      with something simpler on sane_behavior is strongly recommended.
      
      This patch moves cgroup_event and "cgroup.event_control"
      implementation to mm/memcontrol.c.  Clearing of events on cgroup
      destruction is moved from cgroup_destroy_locked() to
      mem_cgroup_css_offline(), which shouldn't make any noticeable
      difference.
      
      cgroup_css() and __file_cft() are exported to enable the move;
      however, this will soon be reverted once the event code is updated to
      be memcg specific.
      
      Note that "cgroup.event_control" will now exist only on the hierarchy
      with memcg attached to it.  While this change is visible to userland,
      it is unlikely to be noticeable as the file has never been meaningful
      outside memcg.
      
      Aside from the above change, this is pure code relocation.
      
      v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
          poll.h inclusion moved from cgroup.c to memcontrol.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      79bd9814
    • T
      cgroup: use a dedicated workqueue for cgroup destruction · e5fca243
      Tejun Heo 提交于
      Since be445626 ("cgroup: remove synchronize_rcu() from
      cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
      freeing is performed from a work item from that point on and a later
      commit, ea15f8cc ("cgroup: split cgroup destruction into two
      steps"), moves css offlining to workqueue too.
      
      As cgroup destruction isn't depended upon for memory reclaim, the
      destruction work items were put on the system_wq; unfortunately, some
      controller may block in the destruction path for considerable duration
      while holding cgroup_mutex.  As large part of destruction path is
      synchronized through cgroup_mutex, when combined with high rate of
      cgroup removals, this has potential to fill up system_wq's max_active
      of 256.
      
      Also, it turns out that memcg's css destruction path ends up queueing
      and waiting for work items on system_wq through work_on_cpu().  If
      such operation happens while system_wq is fully occupied by cgroup
      destruction work items, work_on_cpu() can't make forward progress
      because system_wq is full and other destruction work items on
      system_wq can't make forward progress because the work item waiting
      for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
      
      This can be fixed by queueing destruction work items on a separate
      workqueue.  This patch creates a dedicated workqueue -
      cgroup_destroy_wq - for this purpose.  As these work items shouldn't
      have inter-dependencies and mostly serialized by cgroup_mutex anyway,
      giving high concurrency level doesn't buy anything and the workqueue's
      @max_active is set to 1 so that destruction work items are executed
      one by one on each CPU.
      
      Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
      cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
      separate core_initcall().  In the future, we probably want to reorder
      so that workqueue init happens before cgroup_init().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NShawn Bohrer <shawn.bohrer@gmail.com>
      Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
      Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
      Cc: stable@vger.kernel.org # v3.9+
      e5fca243
  4. 16 11月, 2013 1 次提交
  5. 14 10月, 2013 1 次提交
  6. 24 9月, 2013 1 次提交
  7. 10 9月, 2013 1 次提交
    • T
      cgroup: fix cgroup post-order descendant walk of empty subtree · 58b79a91
      Tejun Heo 提交于
      bd8815a6 ("cgroup: make css_for_each_descendant() and friends
      include the origin css in the iteration") updated cgroup descendant
      iterators to include the origin css; unfortuantely, it forgot to drop
      special case handling in css_next_descendant_post() for empty subtree
      leading to failure to visit the origin css without any child.
      
      Fix it by dropping the special case handling and always returning the
      leftmost descendant on the first iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      58b79a91
  8. 08 9月, 2013 1 次提交
  9. 29 8月, 2013 1 次提交
    • H
      cgroup: fix rmdir EBUSY regression in 3.11 · bb78a92f
      Hugh Dickins 提交于
      On 3.11-rc we are seeing cgroup directories left behind when they should
      have been removed.  Here's a trivial reproducer:
      
      cd /sys/fs/cgroup/memory
      mkdir parent parent/child; rmdir parent/child parent
      rmdir: failed to remove `parent': Device or resource busy
      
      It's because cgroup_destroy_locked() (step 1 of destruction) leaves
      cgroup on parent's children list, letting cgroup_offline_fn() (step 2 of
      destruction) remove it; but step 2 is run by work queue, which may not
      yet have removed the children when parent destruction checks the list.
      
      Fix that by checking through a non-empty list of children: if every one
      of them has already been marked CGRP_DEAD, then it's safe to proceed:
      those children are invisible to userspace, and should not obstruct rmdir.
      
      (I didn't see any reason to keep the cgrp->children checks under the
      unrelated css_set_lock, so moved them out.)
      
      tj: Flattened nested ifs a bit and updated comment so that it's
          correct on both for-3.11-fixes and for-3.12.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bb78a92f
  10. 28 8月, 2013 1 次提交
  11. 27 8月, 2013 5 次提交
  12. 19 8月, 2013 3 次提交
    • T
      cgroup: fix cgroup_write_event_control() · 6e6eab0e
      Tejun Heo 提交于
      81eeaf04 ("cgroup: make cftype->[un]register_event() deal with
      cgroup_subsys_state inst ead of cgroup") updated the cftype event
      methods to take @css (cgroup_subsys_state) instead of @cgroup;
      however, it incorrectly used @css passed to
      cgroup_write_event_control(), which the dummy_css for the cgroup as
      the file is a cgroup core file.  This leads to oops on event
      registration.
      
      Fix it by using the css matching the event target file.  Note that
      cgroup_write_event_control() now disallows cgroup core files from
      being event sources.  This is for simplicity and doesn't matter as
      cgroup_event will be moved and made specific to memcg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6e6eab0e
    • T
      cgroup: fix subsystem file accesses on the root cgroup · 0bfb4aa6
      Tejun Heo 提交于
      105347ba ("cgroup: make cgroup_file_open() rcu_read_lock() around
      cgroup_css() and add cfent->css") added cfent->css to cache the
      associted cgroup_subsys_state across file operations.
      
      A cfent is associated with single css throughout its lifetime and the
      origimal commit initialized the cache pointer during cgroup_add_file()
      and verified that it matches the actual one in cgroup_file_open().
      While this works fine for !root cgroups, it's broken for root cgroups
      as files in a root cgroup are created before the css's are associated
      with the cgroup and thus cgroup_css() call in cgroup_add_file()
      returns NULL associating all cfents in the root cgroup with NULL css.
      This makes cgroup_file_open() trigger WARN and fail with -ENODEV for
      all !core subsystem files in the root cgroups.
      
      There's no reason to initialize cfent->css separately from
      cgroup_add_file().  As the association never changes,
      cgroup_file_open() can set it unconditionally every time and
      containing the logic in cgroup_file_open() makes more sense anyway as
      the only reason it's necessary is file->private_data being already
      occupied.
      
      Fix it by setting cfent->css unconditionally from cgroup_file_open().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0bfb4aa6
    • L
      cgroup: change cgroup_from_id() to css_from_id() · 1cb650b9
      Li Zefan 提交于
      Now we want cgroup core to always provide the css to use to the
      subsystems, so change this API to css_from_id().
      
      Uninline css_from_id(), because it's getting bigger and cgroup_css()
      has been unexported.
      
      While at it, remove the #ifdef, and shuffle the order of the args.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1cb650b9
  13. 16 8月, 2013 1 次提交
    • L
      cgroup: use css_get() in cgroup_create() to check CSS_ROOT · 930913a3
      Li Zhong 提交于
      It seems that the root css doesn't have refcnt allocated(not needed?),
      and would cause the booting error attached.
      
      This patch tries to use css_get() to not increase the refcnt if parent
      is root.
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
        PGD 0
        Oops: 0002 [#1]
        Modules linked in:
        CPU: 0 PID: 1 Comm: systemd Not tainted 3.11.0-rc5-next-20130815+ #1
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
        task: ffff88007f868000 ti: ffff88007f864000 task.ti: ffff88007f864000
        RIP: 0010:[<ffffffff810b37cc>]  [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
        RSP: 0018:ffff88007f865df8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffffffff81a46ee0 RCX: 0000000000000001
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81a415c0
        RBP: ffff88007f865ec8 R08: 0000000000000001 R09: 0000000000000000
        R10: ffff88007ce6d060 R11: 0000000000000000 R12: ffff88007ce6d000
        R13: ffff88007ce6d060 R14: ffffffff81a46d80 R15: ffff88007c6e8018
        FS:  00007f13dbf6f840(0000) GS:ffffffff81a23000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 000000007b7e5000 CR4: 00000000000006b0
        Stack:
         ffffffff810b380d 0000000000000002 ffff88007f865e18 ffffffff81167069
         ffff88007f865ed8 ffffffff8116a3f5 ffff880037454400 ffff88007c6e8018
         ffff88007c6e8028 ffff88007c6e8328 ffff88007c6e8000 ffff88007ce6d000
        Call Trace:
         [<ffffffff810b380d>] ? cgroup_mkdir+0x3bd/0x740
         [<ffffffff81167069>] ? lookup_hash+0x19/0x20
         [<ffffffff8116a3f5>] ? kern_path_create+0x95/0x170
         [<ffffffff8116ce3e>] vfs_mkdir+0x9e/0xf0
         [<ffffffff8116d7a0>] SyS_mkdirat+0x60/0xe0
         [<ffffffff8116d839>] SyS_mkdir+0x19/0x20
         [<ffffffff814c960d>] tracesys+0xcf/0xd4
        Code: ad 70 ff ff ff 48 89 9d 60 ff ff ff 4d 89 d5 4c 8b bd 68 ff ff ff 4c 8b 65 88 eb 50 0f 1f 00 48 8b 43 18 a8 03 0f 85 6c 03 00 00 <ff> 00 e8 1d 0a fb ff 85 c0 74 0d 80 3d f0 45 a1 00 00 0f 84 4c
        RIP  [<ffffffff810b37cc>] cgroup_mkdir+0x37c/0x740
         RSP <ffff88007f865df8>
        CR2: 0000000000000000
        ---[ end trace a4b14b49bc46fd60 ]---
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      930913a3
  14. 14 8月, 2013 7 次提交
    • T
      cgroup: RCU protect each cgroup_subsys_state release · 0c21ead1
      Tejun Heo 提交于
      With the planned unified hierarchy, individual css's will be created
      and destroyed dynamically across the lifetime of a cgroup.  To enable
      such usages, css destruction is being decoupled from cgroup
      destruction.  Most of the destruction path has been decoupled but the
      actual free of css still depends on cgroup free path.
      
      When all css refs are drained, css_release() kicks off
      css_free_work_fn() which puts the cgroup.  When the cgroup refcnt
      reaches zero, cgroup_diput() is invoked which in turn schedules RCU
      free of the cgroup.  After a grace period, all css's are freed along
      with the cgroup itself.
      
      This patch moves the RCU grace period and css freeing from cgroup
      release path to css release path.  css_release(), instead of kicking
      off css_free_work_fn() directly, schedules RCU callback
      css_free_rcu_fn() which in turn kicks off css_free_work_fn() after a
      RCU grace period.  css_free_work_fn() is updated to free the css
      directly.
      
      The five-way punting - percpu ref kill confirmation, a work item,
      percpu ref release, RCU grace period, and again a work item - is quite
      hairy but the work items are there only to provide process context and
      the actual sequence is kill confirm -> release -> RCU free, which
      isn't simple but not too crazy.
      
      This removes cgroup_css() usage after offline_css() allowing clearing
      cgroup->subsys[] from offline_css(), which makes it consistent with
      online_css() and brings it closer to proper lifetime management for
      individual css's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0c21ead1
    • T
      cgroup: move subsys file removal to kill_css() · 3c14f8b4
      Tejun Heo 提交于
      With the planned unified hierarchy, individual css's will be created
      and destroyed dynamically across the lifetime of a cgroup.  To enable
      such usages, css destruction is being decoupled from cgroup
      destruction.  This patch moves subsys file removal from
      cgroup_destroy_locked() to kill_css().
      
      While this changes the order of destruction operations, the changes
      shouldn't be noticeable to cgroup subsystems or userland.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3c14f8b4
    • T
      cgroup: factor out kill_css() · edae0c33
      Tejun Heo 提交于
      Factor out css ref killing from cgroup_destroy_locked() into
      kill_css().  We're gonna add more to the path and the factored out
      function will eventually be called from other places too.
      
      While at it, replace open coded percpu_ref_get() with css_get() for
      consistency.  This shouldn't cause any functional difference as the
      function is not used for root cgroups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      edae0c33
    • T
      cgroup: decouple cgroup_subsys_state destruction from cgroup destruction · 09a503ea
      Tejun Heo 提交于
      Currently, css (cgroup_subsys_state) lifetime is tied to that of the
      associated cgroup.  css's are created when the associated cgroup is
      created and destroyed when it gets destroyed.  Also, individual css's
      aren't RCU protected but the whole cgroup is.  With the planned
      unified hierarchy, css's will need to be dynamically created and
      destroyed within the lifetime of a cgroup.
      
      To enable such usages, this patch decouples css destruction from
      cgroup destruction - offline_css() invocation and the final css_put()
      are moved from cgroup_destroy_css_killed() to css_killed_work_fn().
      Now each css is individually offlined and put as its reference count
      is killed instead of waiting for all css's attached to the cgroup to
      finish refcnt killing and then proceeding to offlining and putting
      them together.
      
      While this changes the order of destruction operations, the changes
      shouldn't be noticeable to cgroup subsystems or userland.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      09a503ea
    • T
      cgroup: replace cgroup->css_kill_cnt with ->nr_css · f20104de
      Tejun Heo 提交于
      Currently, css (cgroup_subsys_state) lifetime is tied to that of the
      associated cgroup.  With the planned unified hierarchy, css's will be
      dynamically created and destroyed within the lifetime of a cgroup.  To
      enable such usages, css's will be individually RCU protected instead
      of being tied to the cgroup.
      
      cgroup->css_kill_cnt is used during cgroup destruction to wait for css
      reference count disable; however, this model doesn't work once css's
      lifetimes are managed separately from cgroup's.  This patch replaces
      it with cgroup->nr_css which is an cgroup_mutex protected integer
      counting the number of attached css's.  The count is incremented from
      online_css() and decremented after refcnt kill is confirmed.  If the
      count reaches zero and the cgroup is marked dead, the second stage of
      cgroup destruction is kicked off.  If a cgroup doesn't have any css
      attached at the time of rmdir, cgroup_destroy_locked() now invokes the
      second stage directly as no css kill confirmation would happen.
      
      cgroup_offline_fn() - the second step of cgroup destruction - is
      renamed to cgroup_destroy_css_killed() and now expects to be called
      with cgroup_mutex held.
      
      While this patch changes how css destruction is punted to work items,
      it shouldn't change any visible behavior.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f20104de
    • T
      cgroup: bounce cgroup_subsys_state ref kill confirmation to a work item · 223dbc38
      Tejun Heo 提交于
      css (cgroup_subsys_state) offlining, which requires process context,
      will be moved to ref kill confirmation.  In preparation, bounce
      css_killed handling through css->destroy_work.
      
      css_ref_killed_fn() is renamed to css_killed_ref_fn() so that it's
      consistent with the new css_killed_work_fn().
      
      This patch adds an additional work item bouncing but doesn't change
      the actual logic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      223dbc38
    • T
      cgroup: move cgroup->subsys[] assignment to online_css() · ae7f164a
      Tejun Heo 提交于
      Currently, css (cgroup_subsys_state) lifetime is tied to that of the
      associated cgroup.  With the planned unified hierarchy, css's will be
      dynamically created and destroyed within the lifetime of a cgroup.  To
      enable such usages, css's will be individually RCU protected instead
      of being tied to the cgroup.
      
      In preparation, this patch moves cgroup->subsys[] assignment from
      init_css() to online_css().  As this means that a newly initialized
      css should be remembered separately and that cgroup_css() returns NULL
      between init and online, cgroup_create() is updated so that it stores
      newly created css's in a local array css_ar[] and
      cgroup_init/load_subsys() are updated to use local variable @css
      instead of using cgroup_css().  This change also slightly simplifies
      error path of cgroup_create().
      
      While this patch changes when cgroup->subsys[] is initialized, this
      change isn't visible to subsystems or userland.
      
      v2: This patch wasn't updated accordingly after the previous "cgroup:
          reorganize css init / exit paths" was updated leading to missing a
          css_ar[] conversion in cgroup_create() and thus boot failure.  Fix
          it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ae7f164a
  15. 13 8月, 2013 5 次提交
    • T
      cgroup: reorganize css init / exit paths · 623f926b
      Tejun Heo 提交于
      css (cgroup_subsys_state) lifetime management is about to be
      restructured.  In prepartion, make the following mostly trivial
      changes.
      
      * init_cgroup_css() is renamed to init_css() so that it's consistent
        with other css handling functions.
      
      * alloc_css_id(), online_css() and offline_css() updated to take @css
        instead of cgroups and subsys IDs.
      
      This patch doesn't make any functional changes.
      
      v2: v1 merged two for_each_root_subsys() loops in cgroup_create() but
          Li Zefan pointed out that it breaks error path.  Dropped.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      623f926b
    • T
      cgroup: add __rcu modifier to cgroup->subsys[] · 73e80ed8
      Tejun Heo 提交于
      For the planned unified hierarchy, each css (cgroup_subsys_state) will
      be RCU protected so that it can be created and destroyed individually
      while allowing RCU accesses.  Previous changes ensured that all
      cgroup->subsys[] accesses use the cgroup_css() accessor.  This patch
      adds __rcu modifier to cgroup->subsys[], add matching RCU dereference
      in cgroup_css() and convert all assignments to either
      rcu_assign_pointer() or RCU_INIT_POINTER().
      
      This change prepares for the actual RCUfication of css's and doesn't
      introduce any visible behavior change.  The conversion is verified
      with sparse and all accesses are properly RCU annotated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      73e80ed8
    • T
      cgroup: make cgroup_file_open() rcu_read_lock() around cgroup_css() and add cfent->css · 105347ba
      Tejun Heo 提交于
      For the planned unified hierarchy, each css (cgroup_subsys_state) will
      be RCU protected so that it can be created and destroyed individually
      while allowing RCU accesses, and cgroup_css() will soon require either
      holding cgroup_mutex or RCU read lock.
      
      This patch updates cgroup_file_open() such that it acquires the
      associated css under rcu_read_lock().  While cgroup_file_css() usages
      in other file operations are safe due to the reference from open,
      cgroup_css() wouldn't know that and will still trigger warnings.  It'd
      be cleanest to store the acquired css in file->prvidate_data for
      further file operations but that's already used by seqfile.  This
      patch instead adds cfent->css to cache the associated css.  Note that
      while this field is initialized during cfe init, it should only be
      considered valid while the file is open.
      
      This patch doesn't change visible behavior.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      105347ba
    • T
      cgroup: cgroup_css_from_dir() now should be called with RCU read locked · b77d7b60
      Tejun Heo 提交于
      cgroup->subsys[] will become RCU protected and thus all cgroup_css()
      usages should either be under RCU read lock or cgroup_mutex.  This
      patch updates cgroup_css_from_dir() which returns the matching
      cgroup_subsys_state given a directory file and subsys_id so that it
      requires RCU read lock and updates its sole user
      perf_cgroup_connect().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      b77d7b60
    • T
      cgroup: add cgroup_subsys_state->parent · 0ae78e0b
      Tejun Heo 提交于
      With the planned unified hierarchy, css's (cgroup_subsys_state) will
      be RCU protected and allowed to be attached and detached dynamically
      over the course of a cgroup's lifetime.  This means that css's will
      stay accessible after being detached from its cgroup - the matching
      pointer in cgroup->subsys[] cleared - for ref draining and RCU grace
      period.
      
      cgroup core still wants to guarantee that the parent css is never
      destroyed before its children and css_parent() always returns the
      parent regardless of the state of the child css as long as it's
      accessible.
      
      This patch makes css's hold onto their parents and adds css->parent so
      that the parent css is never detroyed before its children and can be
      determined without consulting the cgroups.
      
      cgroup->dummy_css is also updated to point to the parent dummy_css;
      however, it doesn't need to worry about object lifetime as the parent
      cgroup is already pinned by the child.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0ae78e0b