1. 20 11月, 2012 16 次提交
    • T
      cgroup: rename ->create/post_create/pre_destroy/destroy() to ->css_alloc/online/offline/free() · 92fb9748
      Tejun Heo 提交于
      Rename cgroup_subsys css lifetime related callbacks to better describe
      what their roles are.  Also, update documentation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      92fb9748
    • T
      cgroup: allow ->post_create() to fail · b1929db4
      Tejun Heo 提交于
      There could be cases where controllers want to do initialization
      operations which may fail from ->post_create().  This patch makes
      ->post_create() return -errno to indicate failure and online_css()
      relay such failures.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      b1929db4
    • T
      cgroup: update cgroup_create() failure path · 4b8b47eb
      Tejun Heo 提交于
      cgroup_create() was ignoring failure of cgroupfs files.  Update it
      such that, if file creation fails, it rolls back by calling
      cgroup_destroy_locked() and returns failure.
      
      Note that error out goto labels are renamed.  The labels are a bit
      confusing but will become better w/ later cgroup operation renames.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4b8b47eb
    • T
      cgroup: use mutex_trylock() when grabbing i_mutex of a new cgroup directory · b8a2df6a
      Tejun Heo 提交于
      All cgroup directory i_mutexes nest outside cgroup_mutex; however, new
      directory creation is a special case.  A new cgroup directory is
      created while holding cgroup_mutex.  Populating the new directory
      requires both the new directory's i_mutex and cgroup_mutex.  Because
      all directory i_mutexes nest outside cgroup_mutex, grabbing both
      requires releasing cgroup_mutex first, which isn't a good idea as the
      new cgroup isn't yet ready to be manipulated by other cgroup
      opreations.
      
      This is worked around by grabbing the new directory's i_mutex while
      holding cgroup_mutex before making it visible.  As there's no other
      user at that point, grabbing the i_mutex under cgroup_mutex can't lead
      to deadlock.
      
      cgroup_create_file() was using I_MUTEX_CHILD to tell lockdep not to
      worry about the reverse locking order; however, this creates pseudo
      locking dependency cgroup_mutex -> I_MUTEX_CHILD, which isn't true -
      all directory i_mutexes are still nested outside cgroup_mutex.  This
      pseudo locking dependency can lead to spurious lockdep warnings.
      
      Use mutex_trylock() instead.  This will always succeed and lockdep
      doesn't create any locking dependency for it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b8a2df6a
    • T
      cgroup: simplify cgroup_load_subsys() failure path · d19e19de
      Tejun Heo 提交于
      Now that cgroup_unload_subsys() can tell whether the root css is
      online or not, we can safely call cgroup_unload_subsys() after idr
      init failure in cgroup_load_subsys().
      
      Replace the manual unrolling and invoke cgroup_unload_subsys() on
      failure.  This drops cgroup_mutex inbetween but should be safe as the
      subsystem will fail try_module_get() and thus can't be mounted
      inbetween.  As this means that cgroup_unload_subsys() can be called
      before css_sets are rehashed, remove BUG_ON() on %NULL
      css_set->subsys[] from cgroup_unload_subsys().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d19e19de
    • T
      cgroup: introduce CSS_ONLINE flag and on/offline_css() helpers · a31f2d3f
      Tejun Heo 提交于
      New helpers on/offline_css() respectively wrap ->post_create() and
      ->pre_destroy() invocations.  online_css() sets CSS_ONLINE after
      ->post_create() is complete and offline_css() invokes ->pre_destroy()
      iff CSS_ONLINE is set and clears it while also handling the temporary
      dropping of cgroup_mutex.
      
      This patch doesn't introduce any behavior change at the moment but
      will be used to improve cgroup_create() failure path and allow
      ->post_create() to fail.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a31f2d3f
    • T
      cgroup: separate out cgroup_destroy_locked() · 42809dd4
      Tejun Heo 提交于
      Separate out cgroup_destroy_locked() from cgroup_destroy().  This will
      be later used in cgroup_create() failure path.
      
      While at it, add lockdep asserts on i_mutex and cgroup_mutex, and move
      @d and @parent assignments to their declarations.
      
      This patch doesn't introduce any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      42809dd4
    • T
      cgroup: fix harmless bugs in cgroup_load_subsys() fail path and cgroup_unload_subsys() · 02ae7486
      Tejun Heo 提交于
      * If idr init fails, cgroup_load_subsys() cleared dummytop->subsys[]
        before calilng ->destroy() making CSS inaccessible to the callback,
        and didn't unlink ss->sibling.  As no modular controller uses
        ->use_id, this doesn't cause any actual problems.
      
      * cgroup_unload_subsys() was forgetting to free idr, call
        ->pre_destroy() and clear ->active.  As there currently is no
        modular controller which uses ->use_id, ->pre_destroy() or ->active,
        this doesn't cause any actual problems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      02ae7486
    • T
      cgroup: lock cgroup_mutex in cgroup_init_subsys() · 648bb56d
      Tejun Heo 提交于
      Make cgroup_init_subsys() grab cgroup_mutex while initializing a
      subsystem so that all helpers and callbacks are called under the
      context they expect.  This isn't strictly necessary as
      cgroup_init_subsys() doesn't race with anybody but will allow adding
      lockdep assertions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      648bb56d
    • T
      cgroup: trivial cleanup for cgroup_init/load_subsys() · b48c6a80
      Tejun Heo 提交于
      Consistently use @css and @dummytop in these two functions instead of
      referring to them indirectly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b48c6a80
    • T
      cgroup: make CSS_* flags bit masks instead of bit positions · 38b53aba
      Tejun Heo 提交于
      Currently, CSS_* flags are defined as bit positions and manipulated
      using atomic bitops.  There's no reason to use atomic bitops for them
      and bit positions are clunkier to deal with than bit masks.  Make
      CSS_* bit masks instead and use the usual C bitwise operators to
      access them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      38b53aba
    • T
      cgroup: cgroup->dentry isn't a RCU pointer · febfcef6
      Tejun Heo 提交于
      cgroup->dentry is marked and used as a RCU pointer; however, it isn't
      one - the final dentry put doesn't go through call_rcu().  cgroup and
      dentry share the same RCU freeing rule via synchronize_rcu() in
      cgroup_diput() (kfree_rcu() used on cgrp is unnecessary).  If cgrp is
      accessible under RCU read lock, so is its dentry and dereferencing
      cgrp->dentry doesn't need any further RCU protection or annotation.
      
      While not being accurate, before the previous patch, the RCU accessors
      served a purpose as memory barriers - cgroup->dentry used to be
      assigned after the cgroup was made visible to cgroup_path(), so the
      assignment and dereferencing in cgroup_path() needed the memory
      barrier pair.  Now that list_add_tail_rcu() happens after
      cgroup->dentry is assigned, this no longer is necessary.
      
      Remove the now unnecessary and misleading RCU annotations from
      cgroup->dentry.  To make up for the removal of rcu_dereference_check()
      in cgroup_path(), add an explicit rcu_lockdep_assert(), which asserts
      the dereference rule of @cgrp, not cgrp->dentry.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      febfcef6
    • T
      cgroup: create directory before linking while creating a new cgroup · 4e139afc
      Tejun Heo 提交于
      While creating a new cgroup, cgroup_create() links the newly allocated
      cgroup into various places before trying to create its directory.
      Because cgroup life-cycle is tied to the vfs objects, this makes it
      impossible to use cgroup_rmdir() for rolling back creation - the
      removal logic depends on having full vfs objects.
      
      This patch moves directory creation above linking and collect linking
      operations to one place.  This allows directory creation failure to
      share error exit path with css allocation failures and any failure
      sites afterwards (to be added later) can use cgroup_rmdir() logic to
      undo creation.
      
      Note that this also makes the memory barriers around cgroup->dentry,
      which currently is misleadingly using RCU operations, unnecessary.
      This will be handled in the next patch.
      
      While at it, locking BUG_ON() on i_mutex is converted to
      lockdep_assert_held().
      
      v2: Patch originally removed %NULL dentry check in cgroup_path();
          however, Li pointed out that this patch doesn't make it
          unnecessary as ->create() may call cgroup_path().  Drop the
          change for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4e139afc
    • T
      cgroup: open-code cgroup_create_dir() · 28fd6f30
      Tejun Heo 提交于
      The operation order of cgroup creation is about to change and
      cgroup_create_dir() is more of a hindrance than a proper abstraction.
      Open-code it by moving the parent nlink adjustment next to self nlink
      adjustment in cgroup_create_file() and the rest to cgroup_create().
      
      This patch doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      28fd6f30
    • T
      cgroup: initialize cgrp->allcg_node in init_cgroup_housekeeping() · 2243076a
      Tejun Heo 提交于
      Not strictly necessary but it's annoying to have uninitialized
      list_head around.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2243076a
    • T
      cgroup: remove incorrect dget/dput() pair in cgroup_create_dir() · 17543163
      Tejun Heo 提交于
      cgroup_create_dir() does weird dancing with dentry refcnt.  On
      success, it gets and then puts it achieving nothing.  On failure, it
      puts but there isn't no matching get anywhere leading to the following
      oops if cgroup_create_file() fails for whatever reason.
      
        ------------[ cut here ]------------
        kernel BUG at /work/os/work/fs/dcache.c:552!
        invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        CPU 2
        Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs
        RIP: 0010:[<ffffffff811d9c0c>]  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
        RSP: 0018:ffff88001a3ebef8  EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403
        RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58
        RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001
        R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea
        R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60
        FS:  00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
        Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080)
        Stack:
         ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000
         ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8
         ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8
        Call Trace:
         [<ffffffff811cc889>] done_path_create+0x19/0x50
         [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80
         [<ffffffff811d2009>] sys_mkdir+0x19/0x20
         [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b
        Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41
        RIP  [<ffffffff811d9c0c>] dput+0x1dc/0x1e0
         RSP <ffff88001a3ebef8>
        ---[ end trace 1277bcfd9561ddb0 ]---
      
      Fix it by dropping the unnecessary dget/dput() pair.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      17543163
  2. 10 11月, 2012 9 次提交
    • T
      cgroup_freezer: implement proper hierarchy support · ef9fe980
      Tejun Heo 提交于
      Up until now, cgroup_freezer didn't implement hierarchy properly.
      cgroups could be arranged in hierarchy but it didn't make any
      difference in how each cgroup_freezer behaved.  They all operated
      separately.
      
      This patch implements proper hierarchy support.  If a cgroup is
      frozen, all its descendants are frozen.  A cgroup is thawed iff it and
      all its ancestors are THAWED.  freezer.self_freezing shows the current
      freezing state for the cgroup itself.  freezer.parent_freezing shows
      whether the cgroup is freezing because any of its ancestors is
      freezing.
      
      freezer_post_create() locks the parent and new cgroup and inherits the
      parent's state and freezer_change_state() applies new state top-down
      using cgroup_for_each_descendant_pre() which guarantees that no child
      can escape its parent's state.  update_if_frozen() uses
      cgroup_for_each_descendant_post() to propagate frozen states
      bottom-up.
      
      Synchronization could be coarser and easier by using a single mutex to
      protect all hierarchy operations.  Finer grained approach was used
      because it wasn't too difficult for cgroup_freezer and I think it's
      beneficial to have an example implementation and cgroup_freezer is
      rather simple and can serve a good one.
      
      As this makes cgroup_freezer properly hierarchical,
      freezer_subsys.broken_hierarchy marking is removed.
      
      Note that this patch changes userland visible behavior - freezing a
      cgroup now freezes all its descendants too.  This behavior change is
      intended and has been warned via .broken_hierarchy.
      
      v2: Michal spotted a bug in freezer_change_state() - descendants were
          inheriting from the wrong ancestor.  Fixed.
      
      v3: Documentation/cgroups/freezer-subsystem.txt updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      ef9fe980
    • T
      cgroup_freezer: add ->post_create() and ->pre_destroy() and track online state · 5300a9b3
      Tejun Heo 提交于
      A cgroup is online and visible to iteration between ->post_create()
      and ->pre_destroy().  This patch introduces CGROUP_FREEZER_ONLINE and
      toggles it from the newly added freezer_post_create() and
      freezer_pre_destroy() while holding freezer->lock such that a
      cgroup_freezer can be reilably distinguished to be online.  This will
      be used by full hierarchy support.
      
      ONLINE test is added to freezer_apply_state() but it currently doesn't
      make any difference as freezer_write() can only be called for an
      online cgroup.
      
      Adjusting system_freezing_cnt on destruction is moved from
      freezer_destroy() to the new freezer_pre_destroy() for consistency.
      
      This patch doesn't introduce any noticeable behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      5300a9b3
    • T
      cgroup_freezer: introduce CGROUP_FREEZING_[SELF|PARENT] · a2252180
      Tejun Heo 提交于
      Introduce FREEZING_SELF and FREEZING_PARENT and make FREEZING OR of
      the two flags.  This is to prepare for full hierarchy support.
      
      freezer_apply_date() is updated such that it can handle setting and
      clearing of both flags.  The two flags are also exposed to userland
      via read-only files self_freezing and parent_freezing.
      
      Other than the added cgroupfs files, this patch doesn't introduce any
      behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      a2252180
    • T
      cgroup_freezer: make freezer->state mask of flags · d6a2fe13
      Tejun Heo 提交于
      freezer->state was an enum value - one of THAWED, FREEZING and FROZEN.
      As the scheduled full hierarchy support requires more than one
      freezing condition, switch it to mask of flags.  If FREEZING is not
      set, it's thawed.  FREEZING is set if freezing or frozen.  If frozen,
      both FREEZING and FROZEN are set.  Now that tasks can be attached to
      an already frozen cgroup, this also makes freezing condition checks
      more natural.
      
      This patch doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      d6a2fe13
    • T
      cgroup_freezer: prepare freezer_change_state() for full hierarchy support · 04a4ec32
      Tejun Heo 提交于
      * Make freezer_change_state() take bool @freeze instead of enum
        freezer_state.
      
      * Separate out freezer_apply_state() out of freezer_change_state().
        This makes freezer_change_state() a rather silly thin wrapper.  It
        will be filled with hierarchy handling later on.
      
      This patch doesn't introduce any behavior change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      04a4ec32
    • T
      cgroup_freezer: trivial cleanups · bcd66c89
      Tejun Heo 提交于
      * Clean-up indentation and line-breaks.  Drop the invalid comment
        about freezer->lock.
      
      * Make all internal functions take @freezer instead of both @cgroup
        and @freezer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      bcd66c89
    • T
      cgroup: implement generic child / descendant walk macros · 574bd9f7
      Tejun Heo 提交于
      Currently, cgroup doesn't provide any generic helper for walking a
      given cgroup's children or descendants.  This patch adds the following
      three macros.
      
      * cgroup_for_each_child() - walk immediate children of a cgroup.
      
      * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
        in pre-order tree traversal.
      
      * cgroup_for_each_descendant_post() - visit all descendants of a
        cgroup in post-order tree traversal.
      
      All three only require the user to hold RCU read lock during
      traversal.  Verifying that each iterated cgroup is online is the
      responsibility of the user.  When used with proper synchronization,
      cgroup_for_each_descendant_pre() can be used to propagate state
      updates to descendants in reliable way.  See comments for details.
      
      v2: s/config/state/ in commit message and comments per Michal.  More
          documentation on synchronization rules.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      574bd9f7
    • T
      cgroup: use rculist ops for cgroup->children · eb6fd504
      Tejun Heo 提交于
      Use RCU safe list operations for cgroup->children.  This will be used
      to implement cgroup children / descendant walking which can be used by
      controllers.
      
      Note that cgroup_create() now puts a new cgroup at the end of the
      ->children list instead of head.  This isn't strictly necessary but is
      done so that the iteration order is more conventional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      eb6fd504
    • T
      cgroup: add cgroup_subsys->post_create() · a8638030
      Tejun Heo 提交于
      Currently, there's no way for a controller to find out whether a new
      cgroup finished all ->create() allocatinos successfully and is
      considered "live" by cgroup.
      
      This becomes a problem later when we add generic descendants walking
      to cgroup which can be used by controllers as controllers don't have a
      synchronization point where it can synchronize against new cgroups
      appearing in such walks.
      
      This patch adds ->post_create().  It's called after all ->create()
      succeeded and the cgroup is linked into the generic cgroup hierarchy.
      This plays the counterpart of ->pre_destroy().
      
      When used in combination with the to-be-added generic descendant
      iterators, ->post_create() can be used to implement reliable state
      inheritance.  It will be explained with the descendant iterators.
      
      v2: Added a paragraph about its future use w/ descendant iterators per
          Michal.
      
      v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
          Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      a8638030
  3. 08 11月, 2012 1 次提交
  4. 06 11月, 2012 6 次提交
    • T
      cgroup: make ->pre_destroy() return void · bcf6de1b
      Tejun Heo 提交于
      All ->pre_destory() implementations return 0 now, which is the only
      allowed return value.  Make it return void.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      bcf6de1b
    • T
      cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() · b25ed609
      Tejun Heo 提交于
      CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
      destruction rollback somewhat working.  cgroup_rmdir() used to drain
      CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
      helpers were used to allow the task performing rmdir to wait for the
      next relevant event.
      
      Unfortunately, the wait is visible to controllers too and the
      mechanism got exposed to memcg by 88703267 ("cgroup avoid permanent
      sleep at rmdir").
      
      Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
      unnecessary.  Remove it and all the mechanisms supporting it.  Note
      that memcontrol.c changes are essentially revert of 88703267
      ("cgroup avoid permanent sleep at rmdir").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      b25ed609
    • T
      cgroup: deactivate CSS's and mark cgroup dead before invoking ->pre_destroy() · 1a90dd50
      Tejun Heo 提交于
      Because ->pre_destroy() could fail and can't be called under
      cgroup_mutex, cgroup destruction did something very ugly.
      
        1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
      
        2. Release cgroup_mutex and call ->pre_destroy().
      
        3. Re-grab cgroup_mutex and verify it can still be destroyed; fail
           otherwise.
      
        4. Continue destroying.
      
      In addition to being ugly, it has been always broken in various ways.
      For example, memcg ->pre_destroy() expects the cgroup to be inactive
      after it's done but tasks can be attached and detached between #2 and
      #3 and the conditions that memcg verified in ->pre_destroy() might no
      longer hold by the time control reaches #3.
      
      Now that ->pre_destroy() is no longer allowed to fail.  We can switch
      to the following.
      
        1. Grab cgroup_mutex and verify it can be destroyed; fail otherwise.
      
        2. Deactivate CSS's and mark the cgroup removed thus preventing any
           further operations which can invalidate the verification from #1.
      
        3. Release cgroup_mutex and call ->pre_destroy().
      
        4. Re-grab cgroup_mutex and continue destroying.
      
      After this change, controllers can safely assume that ->pre_destroy()
      will only be called only once for a given cgroup and, once
      ->pre_destroy() is called, the cgroup will stay dormant till it's
      destroyed.
      
      This removes the only reason ->pre_destroy() can fail - new task being
      attached or child cgroup being created inbetween.  Error out path is
      removed and ->pre_destroy() invocation is open coded in
      cgroup_rmdir().
      
      v2: cgroup_call_pre_destroy() removal moved to this patch per Michal.
          Commit message updated per Glauber.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      1a90dd50
    • T
      cgroup: use cgroup_lock_live_group(parent) in cgroup_create() · 976c06bc
      Tejun Heo 提交于
      This patch makes cgroup_create() fail if @parent is marked removed.
      This is to prepare for further updates to cgroup_rmdir() path.
      
      Note that this change isn't strictly necessary.  cgroup can only be
      created via mkdir and the removed marking and dentry removal happen
      without releasing cgroup_mutex, so cgroup_create() can never race with
      cgroup_rmdir().  Even after the scheduled updates to cgroup_rmdir(),
      cgroup_mkdir() and cgroup_rmdir() are synchronized by i_mutex
      rendering the added liveliness check unnecessary.
      
      Do it anyway such that locking is contained inside cgroup proper and
      we don't get nasty surprises if we ever grow another caller of
      cgroup_create().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      976c06bc
    • T
      cgroup: kill CSS_REMOVED · e9316080
      Tejun Heo 提交于
      CSS_REMOVED is one of the several contortions which were necessary to
      support css reference draining on cgroup removal.  All css->refcnts
      which need draining should be deactivated and verified to equal zero
      atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
      needed to be re-activated and css_tryget() shouldn't fail in the
      process.
      
      This was achieved by letting css_tryget() busy-loop until either the
      refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
      (committing to removal).
      
      Now that css refcnt draining is no longer used, there's no need for
      atomic rollback mechanism.  css_tryget() simply can look at the
      reference count and fail if it's deactivated - it's never getting
      re-activated.
      
      This patch removes CSS_REMOVED and updates __css_tryget() to fail if
      the refcnt is deactivated.  As deactivation and removal are a single
      step now, they no longer need to be protected against css_tryget()
      happening from irq context.  Remove local_irq_disable/enable() from
      cgroup_rmdir().
      
      Note that this removes css_is_removed() whose only user is VM_BUG_ON()
      in memcontrol.c.  We can replace it with a check on the refcnt but
      given that the only use case is a debug assert, I think it's better to
      simply unexport it.
      
      v2: Comment updated and explanation on local_irq_disable/enable()
          added per Michal Hocko.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      e9316080
    • T
      cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs · ed957793
      Tejun Heo 提交于
      2ef37d3f ("memcg: Simplify mem_cgroup_force_empty_list error
      handling") removed the last user of __DEPRECATED_clear_css_refs.  This
      patch removes __DEPRECATED_clear_css_refs and mechanisms to support
      it.
      
      * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
      
      * cgroup_clear_css_refs() can no longer fail.  All that needs to be
        done are deactivating refcnts, setting CSS_REMOVED and putting the
        base reference on each css.  Remove cgroup_clear_css_refs() and the
        failure path, and open-code the loops into cgroup_rmdir().
      
      This patch keeps the two for_each_subsys() loops separate while open
      coding them.  They can be merged now but there are scheduled changes
      which need them to be separate, so keep them separate to reduce the
      amount of churn.
      
      local_irq_save/restore() from cgroup_clear_css_refs() are replaced
      with local_irq_disable/enable() for simplicity.  This is safe as
      cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
      switching is necessary to ensure that css_tryget() isn't called from
      IRQ context on the same CPU while lower context is between CSS
      deactivation and setting CSS_REMOVED as css_tryget() would hang
      forever in such cases waiting for CSS to be re-activated or
      CSS_REMOVED set.  This will go away soon.
      
      v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
          message updated to explain local_irq_disable/enable() conversion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ed957793
  5. 27 10月, 2012 1 次提交
    • O
      freezer: change ptrace_stop/do_signal_stop to use freezable_schedule() · 5d8f72b5
      Oleg Nesterov 提交于
      try_to_freeze_tasks() and cgroup_freezer rely on scheduler locks
      to ensure that a task doing STOPPED/TRACED -> RUNNING transition
      can't escape freezing. This mostly works, but ptrace_stop() does
      not necessarily call schedule(), it can change task->state back to
      RUNNING and check freezing() without any lock/barrier in between.
      
      We could add the necessary barrier, but this patch changes
      ptrace_stop() and do_signal_stop() to use freezable_schedule().
      This fixes the race, freezer_count() and freezer_should_skip()
      carefully avoid the race.
      
      And this simplifies the code, try_to_freeze_tasks/update_if_frozen
      no longer need to use task_is_stopped_or_traced() checks with the
      non trivial assumptions. We can rely on the mechanism which was
      specially designed to mark the sleeping task as "frozen enough".
      
      v2: As Tejun pointed out, we can also change get_signal_to_deliver()
      and move try_to_freeze() up before 'relock' label.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5d8f72b5
  6. 26 10月, 2012 2 次提交
    • H
      Makefile: Documentation for external tool should be correct · 2008713c
      H. Peter Anvin 提交于
      If one includes documentation for an external tool, it should be
      correct.  This is not:
      
      1. Overriding the input to rngd should typically be neither
         necessary nor desired.  This is especially so since newer
         versions of rngd support a number of different *types* of sources.
      2. The default kernel-exported device is called /dev/hwrng not
         /dev/hwrandom nor /dev/hw_random (both of which were used in the
         past; however, kernel and udev seem to have converged on
         /dev/hwrng.)
      
      Overall it is better if the documentation for rngd is kept with rngd
      rather than in a kernel Makefile.
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Garzik <jgarzik@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2008713c
    • A
      pidns: limit the nesting depth of pid namespaces · f2302505
      Andrew Vagin 提交于
      'struct pid' is a "variable sized struct" - a header with an array of
      upids at the end.
      
      The size of the array depends on a level (depth) of pid namespaces.  Now a
      level of pidns is not limited, so 'struct pid' can be more than one page.
      
      Looks reasonable, that it should be less than a page.  MAX_PIS_NS_LEVEL is
      not calculated from PAGE_SIZE, because in this case it depends on
      architectures, config options and it will be reduced, if someone adds a
      new fields in struct pid or struct upid.
      
      I suggest to set MAX_PIS_NS_LEVEL = 32, because it saves ability to expand
      "struct pid" and it's more than enough for all known for me use-cases.
      When someone finds a reasonable use case, we can add a config option or a
      sysctl parameter.
      
      In addition it will reduce the effect of another problem, when we have
      many nested namespaces and the oldest one starts dying.
      zap_pid_ns_processe will be called for each namespace and find_vpid will
      be called for each process in a namespace.  find_vpid will be called
      minimum max_level^2 / 2 times.  The reason of that is that when we found a
      bit in pidmap, we can't determine this pidns is top for this process or it
      isn't.
      
      vpid is a heavy operation, so a fork bomb, which create many nested
      namespace, can make a system inaccessible for a long time.  For example my
      system becomes inaccessible for a few minutes with 4000 processes.
      
      [akpm@linux-foundation.org: return -EINVAL in response to excessive nesting, not -ENOMEM]
      Signed-off-by: NAndrew Vagin <avagin@openvz.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2302505
  7. 25 10月, 2012 1 次提交
  8. 22 10月, 2012 1 次提交
  9. 21 10月, 2012 3 次提交
    • T
      cgroup_freezer: don't use cgroup_lock_live_group() · ead5c473
      Tejun Heo 提交于
      freezer_read/write() used cgroup_lock_live_group() to synchronize
      against task migration into and out of the target cgroup.
      cgroup_lock_live_group() grabs the internal cgroup lock and using it
      from outside cgroup core leads to complex and fragile locking
      dependency issues which are difficult to resolve.
      
      Now that freezer_can_attach() is replaced with freezer_attach() and
      update_if_frozen() updated, nothing requires excluding migration
      against freezer state reads and changes.
      
      This patch removes cgroup_lock_live_group() and the matching
      cgroup_unlock() usages.  The prone-to-bitrot, already outdated and
      unnecessary global lock hierarchy documentation is replaced with
      documentation in local scope.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Li Zefan <lizefan@huawei.com>
      ead5c473
    • T
      cgroup_freezer: prepare update_if_frozen() for locking change · b4d18311
      Tejun Heo 提交于
      Locking will change such that migration can happen while
      freezer_read/write() is in progress.  This means that
      update_if_frozen() can no longer assume that all tasks in the cgroup
      coform to the current freezer state - newly migrated tasks which
      haven't finished freezer_attach() yet might be in any state.
      
      This patch updates update_if_frozen() such that it no longer verifies
      task states against freezer state.  It now simply decides whether
      FREEZING stage is complete.
      
      This removal of verification makes it meaningless to call from
      freezer_change_state().  Drop it and move the fast exit test from
      freezer_read() - the only left caller - to update_if_frozen().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Li Zefan <lizefan@huawei.com>
      b4d18311
    • T
      cgroup_freezer: allow moving tasks in and out of a frozen cgroup · 8755ade6
      Tejun Heo 提交于
      cgroup_freezer is one of the few users of cgroup_subsys->can_attach()
      and uses it to prevent tasks from being migrated into or out of a
      frozen cgroup.  This makes cgroup_freezer cumbersome to use especially
      when co-mounted with other controllers.
      
      ->can_attach() is problematic in general as it can make co-mounting
      multiple cgroups difficult - migrating tasks may fail for reasons
      completely irrelevant for other controllers.  freezer_can_attach() in
      particular is more problematic because it messes with cgroup internal
      locking to ensure that the state verification performed at
      freezer_can_attach() stays valid until migration is complete.
      
      This patch replaces freezer_can_attach() with freezer_attach() so that
      tasks are always allowed to migrate - they are nudged into the
      conforming state from freezer_attach().  This means that there can be
      tasks which are being migrated which don't conform to the current
      cgroup_freezer state until freezer_attach() is complete.  Under the
      current locking scheme, the only such place is freezer_fork() which is
      updated to handle such window.
      
      While this patch doesn't remove the use of internal cgroup locking
      from freezer_read/write() paths, it removes the requirement to keep
      the freezer state constant while migrating and enables such change.
      
      Note that this creates a userland visible behavior change - FROZEN
      cgroup can no longer be used to lock migrations in and out of the
      cgroup.  This behavior change is intended.  I don't think the feature
      is necessary - userland should coordinate accesses to cgroup fs anyway
      - and even if the feature is needed cgroup_freezer is the completely
      wrong place to implement it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <1350426526-14254-1-git-send-email-tj@kernel.org>
      Cc: Matt Helsley <matthltc@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Li Zefan <lizefan@huawei.com>
      8755ade6