1. 20 11月, 2012 16 次提交
    • T
      cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() · 92fb9748
      Tejun Heo 提交于
      Rename cgroup_subsys css lifetime related callbacks to better describe
      what their roles are.  Also, update documentation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      92fb9748
    • T
      cgroup: allow ->post_create() to fail · b1929db4
      Tejun Heo 提交于
      There could be cases where controllers want to do initialization
      operations which may fail from ->post_create().  This patch makes
      ->post_create() return -errno to indicate failure and online_css()
      relay such failures.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      b1929db4
    • T
      cgroup: update cgroup_create() failure path · 4b8b47eb
      Tejun Heo 提交于
      cgroup_create() was ignoring failure of cgroupfs files.  Update it
      such that, if file creation fails, it rolls back by calling
      cgroup_destroy_locked() and returns failure.
      
      Note that error out goto labels are renamed.  The labels are a bit
      confusing but will become better w/ later cgroup operation renames.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4b8b47eb
    • T
      cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory · b8a2df6a
      Tejun Heo 提交于
      All cgroup directory i_mutexes nest outside cgroup_mutex; however, new
      directory creation is a special case.  A new cgroup directory is
      created while holding cgroup_mutex.  Populating the new directory
      requires both the new directory's i_mutex and cgroup_mutex.  Because
      all directory i_mutexes nest outside cgroup_mutex, grabbing both
      requires releasing cgroup_mutex first, which isn't a good idea as the
      new cgroup isn't yet ready to be manipulated by other cgroup
      opreations.
      
      This is worked around by grabbing the new directory's i_mutex while
      holding cgroup_mutex before making it visible.  As there's no other
      user at that point, grabbing the i_mutex under cgroup_mutex can't lead
      to deadlock.
      
      cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to
      worry about the reverse locking order; however, this creates pseudo
      locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true -
      all directory i_mutexes are still nested outside cgroup_mutex.  This
      pseudo locking dependency can lead to spurious lockdep warnings.
      
      Use mutex_trylock() instead.  This will always succeed and lockdep
      doesn't create any locking dependency for it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b8a2df6a
    • T
      cgroup: simplify cgroup_load_subsys() failure path · d19e19de
      Tejun Heo 提交于
      Now that cgroup_unload_subsys() can tell whether the root css is
      online or not, we can safely call cgroup_unload_subsys() after idr
      init failure in cgroup_load_subsys().
      
      Replace the manual unrolling and invoke cgroup_unload_subsys() on
      failure.  This drops cgroup_mutex inbetween but should be safe as the
      subsystem will fail try_module_get() and thus can't be mounted
      inbetween.  As this means that cgroup_unload_subsys() can be called
      before css_sets are rehashed, remove BUG_ON() on %NULL
      css_set->subsys[] from cgroup_unload_subsys().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d19e19de
    • T
      cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers · a31f2d3f
      Tejun Heo 提交于
      New helpers on/offline_css() respectively wrap ->post_create() and
      ->pre_destroy() invocations.  online_css() sets CSS_ONLINE after
      ->post_create() is complete and offline_css() invokes ->pre_destroy()
      iff CSS_ONLINE is set and clears it while also handling the temporary
      dropping of cgroup_mutex.
      
      This patch doesn't introduce any behavior change at the moment but
      will be used to improve cgroup_create() failure path and allow
      ->post_create() to fail.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a31f2d3f
    • T
      cgroup: separate out cgroup_destroy_locked() · 42809dd4
      Tejun Heo 提交于
      Separate out cgroup_destroy_locked() from cgroup_destroy().  This will
      be later used in cgroup_create() failure path.
      
      While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move
      @d and @parent assignments to their declarations.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      42809dd4
    • T
      cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys() · 02ae7486
      Tejun Heo 提交于
      * If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[]
        before calilng ->destroy() making CSS inaccessible to the callback,
        and didn't unlink ss->sibling.  As no modular controller uses
        ->use_id, this doesn't cause any actual problems.
      
      * cgroup_unload_subsys() was forgetting to free idr, call
        ->pre_destroy() and clear ->active.  As there currently is no
        modular controller which uses ->use_id, ->pre_destroy() or ->active,
        this doesn't cause any actual problems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      02ae7486
    • T
      cgroup: lock cgroup_mutex in cgroup_init_subsys() · 648bb56d
      Tejun Heo 提交于
      Make cgroup_init_subsys() grab cgroup_mutex while initializing a
      subsystem so that all helpers and callbacks are called under the
      context they expect.  This isn't strictly necessary as
      cgroup_init_subsys() doesn't race with anybody but will allow adding
      lockdep assertions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      648bb56d
    • T
      cgroup: trivial cleanup for cgroup_init/load_subsys() · b48c6a80
      Tejun Heo 提交于
      Consistently use @css and @dummytop in these two functions instead of
      referring to them indirectly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b48c6a80
    • T
      cgroup: make CSS_* flags bit masks instead of bit positions · 38b53aba
      Tejun Heo 提交于
      Currently, CSS_* flags are defined as bit positions and manipulated
      using atomic bitops.  There's no reason to use atomic bitops for them
      and bit positions are clunkier to deal with than bit masks.  Make
      CSS_* bit masks instead and use the usual C bitwise operators to
      access them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      38b53aba
    • T
      cgroup: cgroup->dentry isn't a RCU pointer · febfcef6
      Tejun Heo 提交于
      cgroup->dentry is marked and used as a RCU pointer; however, it isn't
      one - the final dentry put doesn't go through call_rcu().  cgroup and
      dentry share the same RCU freeing rule via synchronize_rcu() in
      cgroup_diput() (kfree_rcu() used on cgrp is unnecessary).  If cgrp is
      accessible under RCU read lock, so is its dentry and dereferencing
      cgrp->dentry doesn't need any further RCU protection or annotation.
      
      While not being accurate, before the previous patch, the RCU accessors
      served a purpose as memory barriers - cgroup->dentry used to be
      assigned after the cgroup was made visible to cgroup_path(), so the
      assignment and dereferencing in cgroup_path() needed the memory
      barrier pair.  Now that list_add_tail_rcu() happens after
      cgroup->dentry is assigned, this no longer is necessary.
      
      Remove the now unnecessary and misleading RCU annotations from
      cgroup->dentry.  To make up for the removal of rcu_dereference_check()
      in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
      the dereference rule of @cgrp, not cgrp->dentry.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      febfcef6
    • T
      cgroup: create directory before linking while creating a new cgroup · 4e139afc
      Tejun Heo 提交于
      While creating a new cgroup, cgroup_create() links the newly allocated
      cgroup into various places before trying to create its directory.
      Because cgroup life-cycle is tied to the vfs objects, this makes it
      impossible to use cgroup_rmdir() for rolling back creation - the
      removal logic depends on having full vfs objects.
      
      This patch moves directory creation above linking and collect linking
      operations to one place.  This allows directory creation failure to
      share error exit path with css allocation failures and any failure
      sites afterwards (to be added later) can use cgroup_rmdir() logic to
      undo creation.
      
      Note that this also makes the memory barriers around cgroup->dentry,
      which currently is misleadingly using RCU operations, unnecessary.
      This will be handled in the next patch.
      
      While at it, locking BUG_ON() on i_mutex is converted to
      lockdep_assert_held().
      
      v2: Patch originally removed %NULL dentry check in cgroup_path();
          however, Li pointed out that this patch doesn't make it
          unnecessary as ->create() may call cgroup_path().  Drop the
          change for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4e139afc
    • T
      cgroup: open-code cgroup_create_dir() · 28fd6f30
      Tejun Heo 提交于
      The operation order of cgroup creation is about to change and
      cgroup_create_dir() is more of a hindrance than a proper abstraction.
      Open-code it by moving the parent nlink adjustment next to self nlink
      adjustment in cgroup_create_file() and the rest to cgroup_create().
      
      This patch doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      28fd6f30
    • T
      cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping() · 2243076a
      Tejun Heo 提交于
      Not strictly necessary but it's annoying to have uninitialized
      list_head around.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2243076a
    • T
      cgroup: remove incorrect dget/dput() pair in cgroup_create_dir() · 17543163
      Tejun Heo 提交于
      cgroup_create_dir() does weird dancing with dentry refcnt.  On
      success, it gets and then puts it achieving nothing.  On failure, it
      puts but there isn't no matching get anywhere leading to the following
      oops if cgroup_create_file() fails for whatever reason.
      
        ------------[ cut here ]------------
        kernel BUG at /work/os/work/fs/dcache.c:552!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU 2
        Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
        RIP: 0010:[<ffffffff811d9c0c>]  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
        RSP: 0018:ffff88001a3ebef8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
        RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
        RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
        R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
        R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
        FS:  00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
        Stack:
         ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
         ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
         ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
        Call Trace:
         [<ffffffff811cc889>] done_path_create+0x19/0x50
         [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80
         [<ffffffff811d2009>] sys_mkdir+0x19/0x20
         [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b
        Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
        RIP  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
         RSP <ffff88001a3ebef8>
        ---[ end trace 1277bcfd9561ddb0 ]---
      
      Fix it by dropping the unnecessary dget/dput() pair.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      17543163
  2. 10 11月, 2012 3 次提交
    • T
      cgroup: implement generic child / descendant walk macros · 574bd9f7
      Tejun Heo 提交于
      Currently, cgroup doesn't provide any generic helper for walking a
      given cgroup's children or descendants.  This patch adds the following
      three macros.
      
      * cgroup_for_each_child() - walk immediate children of a cgroup.
      
      * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
        in pre-order tree traversal.
      
      * cgroup_for_each_descendant_post() - visit all descendants of a
        cgroup in post-order tree traversal.
      
      All three only require the user to hold RCU read lock during
      traversal.  Verifying that each iterated cgroup is online is the
      responsibility of the user.  When used with proper synchronization,
      cgroup_for_each_descendant_pre() can be used to propagate state
      updates to descendants in reliable way.  See comments for details.
      
      v2: s/config/state/ in commit message and comments per Michal.  More
          documentation on synchronization rules.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      574bd9f7
    • T
      cgroup: use rculist ops for cgroup->children · eb6fd504
      Tejun Heo 提交于
      Use RCU safe list operations for cgroup->children.  This will be used
      to implement cgroup children / descendant walking which can be used by
      controllers.
      
      Note that cgroup_create() now puts a new cgroup at the end of the
      ->children list instead of head.  This isn't strictly necessary but is
      done so that the iteration order is more conventional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      eb6fd504
    • T
      cgroup: add cgroup_subsys->post_create() · a8638030
      Tejun Heo 提交于
      Currently, there's no way for a controller to find out whether a new
      cgroup finished all ->create() allocatinos successfully and is
      considered "live" by cgroup.
      
      This becomes a problem later when we add generic descendants walking
      to cgroup which can be used by controllers as controllers don't have a
      synchronization point where it can synchronize against new cgroups
      appearing in such walks.
      
      This patch adds ->post_create().  It's called after all ->create()
      succeeded and the cgroup is linked into the generic cgroup hierarchy.
      This plays the counterpart of ->pre_destroy().
      
      When used in combination with the to-be-added generic descendant
      iterators, ->post_create() can be used to implement reliable state
      inheritance.  It will be explained with the descendant iterators.
      
      v2: Added a paragraph about its future use w/ descendant iterators per
          Michal.
      
      v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
          Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      a8638030
  3. 08 11月, 2012 1 次提交
  4. 06 11月, 2012 6 次提交
    • T
      cgroup: make ->pre_destroy() return void · bcf6de1b
      Tejun Heo 提交于
      All ->pre_destory() implementations return 0 now, which is the only
      allowed return value.  Make it return void.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      bcf6de1b
    • T
      cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() · b25ed609
      Tejun Heo 提交于
      CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
      destruction rollback somewhat working.  cgroup_rmdir() used to drain
      CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
      helpers were used to allow the task performing rmdir to wait for the
      next relevant event.
      
      Unfortunately, the wait is visible to controllers too and the
      mechanism got exposed to memcg by 88703267 ("cgroup avoid permanent
      sleep at rmdir").
      
      Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
      unnecessary.  Remove it and all the mechanisms supporting it.  Note
      that memcontrol.c changes are essentially revert of 88703267
      ("cgroup avoid permanent sleep at rmdir").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      b25ed609
    • T
      cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy() · 1a90dd50
      Tejun Heo 提交于
      Because ->pre_destroy() could fail and can't be called under
      cgroup_mutex, cgroup destruction did something very ugly.
      
        1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
      
        2. Release cgroup_mutex and call ->pre_destroy().
      
        3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
           otherwise.
      
        4. Continue destroying.
      
      In addition to being ugly, it has been always broken in various ways.
      For example, memcg ->pre_destroy() expects the cgroup to be inactive
      after it's done but tasks can be attached and detached between #2 and
      #3 and the conditions that memcg verified in ->pre_destroy() might no
      longer hold by the time control reaches #3.
      
      Now that ->pre_destroy() is no longer allowed to fail.  We can switch
      to the following.
      
        1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
      
        2. Deactivate CSS's and mark the cgroup removed thus preventing any
           further operations which can invalidate the verification from #1.
      
        3. Release cgroup_mutex and call ->pre_destroy().
      
        4. Re-grab cgroup_mutex and continue destroying.
      
      After this change, controllers can safely assume that ->pre_destroy()
      will only be called only once for a given cgroup and, once
      ->pre_destroy() is called, the cgroup will stay dormant till it's
      destroyed.
      
      This removes the only reason ->pre_destroy() can fail - new task being
      attached or child cgroup being created inbetween.  Error out path is
      removed and ->pre_destroy() invocation is open coded in
      cgroup_rmdir().
      
      v2: cgroup_call_pre_destroy() removal moved to this patch per Michal.
          Commit message updated per Glauber.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      1a90dd50
    • T
      cgroup: use cgroup_lock_live_group(parent) in cgroup_create() · 976c06bc
      Tejun Heo 提交于
      This patch makes cgroup_create() fail if @parent is marked removed.
      This is to prepare for further updates to cgroup_rmdir() path.
      
      Note that this change isn't strictly necessary.  cgroup can only be
      created via mkdir and the removed marking and dentry removal happen
      without releasing cgroup_mutex, so cgroup_create() can never race with
      cgroup_rmdir().  Even after the scheduled updates to cgroup_rmdir(),
      cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
      rendering the added liveliness check unnecessary.
      
      Do it anyway such that locking is contained inside cgroup proper and
      we don't get nasty surprises if we ever grow another caller of
      cgroup_create().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      976c06bc
    • T
      cgroup: kill CSS_REMOVED · e9316080
      Tejun Heo 提交于
      CSS_REMOVED is one of the several contortions which were necessary to
      support css reference draining on cgroup removal.  All css->refcnts
      which need draining should be deactivated and verified to equal zero
      atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
      needed to be re-activated and css_tryget() shouldn't fail in the
      process.
      
      This was achieved by letting css_tryget() busy-loop until either the
      refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
      (committing to removal).
      
      Now that css refcnt draining is no longer used, there's no need for
      atomic rollback mechanism.  css_tryget() simply can look at the
      reference count and fail if it's deactivated - it's never getting
      re-activated.
      
      This patch removes CSS_REMOVED and updates __css_tryget() to fail if
      the refcnt is deactivated.  As deactivation and removal are a single
      step now, they no longer need to be protected against css_tryget()
      happening from irq context.  Remove local_irq_disable/enable() from
      cgroup_rmdir().
      
      Note that this removes css_is_removed() whose only user is VM_BUG_ON()
      in memcontrol.c.  We can replace it with a check on the refcnt but
      given that the only use case is a debug assert, I think it's better to
      simply unexport it.
      
      v2: Comment updated and explanation on local_irq_disable/enable()
          added per Michal Hocko.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      e9316080
    • T
      cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs · ed957793
      Tejun Heo 提交于
      2ef37d3f ("memcg: Simplify mem_cgroup_force_empty_list error
      handling") removed the last user of __DEPRECATED_clear_css_refs.  This
      patch removes __DEPRECATED_clear_css_refs and mechanisms to support
      it.
      
      * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
      
      * cgroup_clear_css_refs() can no longer fail.  All that needs to be
        done are deactivating refcnts, setting CSS_REMOVED and putting the
        base reference on each css.  Remove cgroup_clear_css_refs() and the
        failure path, and open-code the loops into cgroup_rmdir().
      
      This patch keeps the two for_each_subsys() loops separate while open
      coding them.  They can be merged now but there are scheduled changes
      which need them to be separate, so keep them separate to reduce the
      amount of churn.
      
      local_irq_save/restore() from cgroup_clear_css_refs() are replaced
      with local_irq_disable/enable() for simplicity.  This is safe as
      cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
      switching is necessary to ensure that css_tryget() isn't called from
      IRQ context on the same CPU while lower context is between CSS
      deactivation and setting CSS_REMOVED as css_tryget() would hang
      forever in such cases waiting for CSS to be re-activated or
      CSS_REMOVED set.  This will go away soon.
      
      v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
          message updated to explain local_irq_disable/enable() conversion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ed957793
  5. 20 10月, 2012 2 次提交
    • T
      Revert "cgroup: Remove task_lock() from cgroup_post_fork()" · d8783832
      Tejun Heo 提交于
      This reverts commit 7e3aa30a.
      
      The commit incorrectly assumed that fork path always performed
      threadgroup_change_begin/end() and depended on that for
      synchronization against task exit and cgroup migration paths instead
      of explicitly grabbing task_lock().
      
      threadgroup_change is not locked when forking a new process (as
      opposed to a new thread in the same process) and even if it were it
      wouldn't be effective as different processes use different threadgroup
      locks.
      
      Revert the incorrect optimization.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <20121008020000.GB2575@localhost>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      d8783832
    • T
      Revert "cgroup: Drop task_lock(parent) on cgroup_fork()" · 9bb71308
      Tejun Heo 提交于
      This reverts commit 7e381b0e.
      
      The commit incorrectly assumed that fork path always performed
      threadgroup_change_begin/end() and depended on that for
      synchronization against task exit and cgroup migration paths instead
      of explicitly grabbing task_lock().
      
      threadgroup_change is not locked when forking a new process (as
      opposed to a new thread in the same process) and even if it were it
      wouldn't be effective as different processes use different threadgroup
      locks.
      
      Revert the incorrect optimization.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <20121008020000.GB2575@localhost>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Bitterly-Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: stable@vger.kernel.org
      9bb71308
  6. 17 10月, 2012 2 次提交
    • D
      cgroup: notify_on_release may not be triggered in some cases · 1f5320d5
      Daisuke Nishimura 提交于
      notify_on_release must be triggered when the last process in a cgroup is
      move to another. But if the first(and only) process in a cgroup is moved to
      another, notify_on_release is not triggered.
      
      	# mkdir /cgroup/cpu/SRC
      	# mkdir /cgroup/cpu/DST
      	#
      	# echo 1 >/cgroup/cpu/SRC/notify_on_release
      	# echo 1 >/cgroup/cpu/DST/notify_on_release
      	#
      	# sleep 300 &
      	[1] 8629
      	#
      	# echo 8629 >/cgroup/cpu/SRC/tasks
      	# echo 8629 >/cgroup/cpu/DST/tasks
      	-> notify_on_release for /SRC must be triggered at this point,
      	   but it isn't.
      
      This is because put_css_set() is called before setting CGRP_RELEASABLE
      in cgroup_task_migrate(), and is a regression introduce by the
      commit:74a1166d(cgroups: make procs file writable), which was merged
      into v3.0.
      
      Cc: Ben Blum <bblum@andrew.cmu.edu>
      Cc: <stable@vger.kernel.org> # v3.0.x and later
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f5320d5
    • T
      cgroup: cgroup_subsys->fork() should be called after the task is added to css_set · 5edee61e
      Tejun Heo 提交于
      cgroup core has a bug which violates a basic rule about event
      notifications - when a new entity needs to be added, you add that to
      the notification list first and then make the new entity conform to
      the current state.  If done in the reverse order, an event happening
      inbetween will be lost.
      
      cgroup_subsys->fork() is invoked way before the new task is added to
      the css_set.  Currently, cgroup_freezer is the only user of ->fork()
      and uses it to make new tasks conform to the current state of the
      freezer.  If FROZEN state is requested while fork is in progress
      between cgroup_fork_callbacks() and cgroup_post_fork(), the child
      could escape freezing - the cgroup isn't frozen when ->fork() is
      called and the freezer couldn't see the new task on the css_set.
      
      This patch moves cgroup_subsys->fork() invocation to
      cgroup_post_fork() after the new task is added to the css_set.
      cgroup_fork_callbacks() is removed.
      
      Because now a task may be migrated during cgroup_subsys->fork(),
      freezer_fork() is updated so that it adheres to the usual RCU locking
      and the rather pointless comment on why locking can be different there
      is removed (if it doesn't make anything simpler, why even bother?).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@vger.kernel.org
      5edee61e
  7. 15 9月, 2012 5 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: Do not depend on a given order when populating the subsys array · 80f4c877
      Daniel Wagner 提交于
      The *_subsys_id will be used as index to access the subsys. Therefore
      we need to care we populate the subsystem at the correct position by
      using designated initialization.
      
      With this change we are able to interleave builtin and modules in the subsys
      array.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      80f4c877
    • D
      cgroup: Wrap subsystem selection macro · 5fc0b025
      Daniel Wagner 提交于
      Before we are able to define all subsystem ids at compile time we need
      a more fine grained control what gets defined when we include
      cgroup_subsys.h. For example we define the enums for the subsystems or
      to declare for struct cgroup_subsys (builtin subsystem) by including
      cgroup_subsys.h and defining SUBSYS accordingly.
      
      Currently, the decision if a subsys is used is defined inside the
      header by testing if CONFIG_*=y is true. By moving this test outside
      of cgroup_subsys.h we are able to control it on the include level.
      
      This is done by introducing IS_SUBSYS_ENABLED which then is defined
      according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      5fc0b025
    • D
      cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT · be45c900
      Daniel Wagner 提交于
      CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
      looping over the subsys array looking either at the builtin or the
      module subsystems. Since all the builtin subsystems have an id which
      is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
      have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
      are sorted.
      
      We are about to change id assignment to happen only at compile time
      later in this series. That means we can't rely on the above trick
      since all ids will always be defined at compile time. Furthermore,
      ordering the builtin subsystems and the module subsystems is not
      really necessary.
      
      So we need a different way to know which subsystem is a builtin or a
      module one. We can use the subsys[]->module pointer for this. Any
      place where we need to know if a subsys is module we just check for
      the pointer. If it is NULL then the subsystem is a builtin one.
      
      With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
      enum. Though we need to introduce a temporary placeholder so that we
      don't get a compilation error when only CONFIG_CGROUP is selected and
      no single controller. An empty enum definition is not valid. Later in
      this series we are able to remove the placeholder again.
      
      And with this change we get a fix for this:
      
      kernel/cgroup.c: In function ‘cgroup_load_subsys’:
      kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
      
      when CONFIG_CGROUP=y and no built in controller was enabled.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      be45c900
  8. 25 8月, 2012 3 次提交
    • A
      cgroup: rename subsys_bits to subsys_mask · a1a71b45
      Aristeu Rozanski 提交于
      In a previous discussion, Tejun Heo suggested to rename references to
      subsys_bits (added_bits, removed_bits, etc) by something more meaningful.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a1a71b45
    • A
      cgroup: add xattr support · 03b1cde6
      Aristeu Rozanski 提交于
      This is one of the items in the plumber's wish list.
      
      For use cases:
      
      >> What would the use case be for this?
      >
      > Attaching meta information to services, in an easily discoverable
      > way. For example, in systemd we create one cgroup for each service, and
      > could then store data like the main pid of the specific service as an
      > xattr on the cgroup itself. That way we'd have almost all service state
      > in the cgroupfs, which would make it possible to terminate systemd and
      > later restart it without losing any state information. But there's more:
      > for example, some very peculiar services cannot be terminated on
      > shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
      > services in question could just mark that on their cgroup, by setting an
      > xattr. On the more desktopy side of things there are other
      > possibilities: for example there are plans defining what an application
      > is along the lines of a cgroup (i.e. an app being a collection of
      > processes). With xattrs one could then attach an icon or human readable
      > program name on the cgroup.
      >
      > The key idea is that this would allow attaching runtime meta information
      > to cgroups and everything they model (services, apps, vms), that doesn't
      > need any complex userspace infrastructure, has good access control
      > (i.e. because the file system enforces that anyway, and there's the
      > "trusted." xattr namespace), notifications (inotify), and can easily be
      > shared among applications.
      >
      > Lennart
      
      v7:
      - no changes
      v6:
      - remove user xattr namespace, only allow trusted and security
      v5:
      - check for capabilities before setting/removing xattrs
      v4:
      - no changes
      v3:
      - instead of config option, use mount option to enable xattr support
      Original-patch-by: NLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      03b1cde6
    • A
      cgroup: revise how we re-populate root directory · 13af07df
      Aristeu Rozanski 提交于
      When remounting cgroupfs with some subsystems added to it and some
      removed, cgroup will remove all the files in root directory and then
      re-popluate it.
      
      What I'm doing here is, only remove files which belong to subsystems that
      are to be unbinded, and only create files for newly-added subsystems.
      The purpose is to have all other files untouched.
      
      This is a preparation for cgroup xattr support.
      
      v7:
      - checkpatch warnings fixed
      v6:
      - no changes
      v5:
      - no changes
      v4:
      - refactored cgroup_clear_directory() to not use cgroup_rm_file()
      - instead of going thru the list of files, get the file list using the
        subsystems
      - use 'subsys_mask' instead of {added,removed}_bits and made
        cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
      v3:
      - refresh patches after recent refactoring
      Original-patch-by: NLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13af07df
  9. 14 7月, 2012 2 次提交