1. 31 7月, 2013 4 次提交
  2. 16 7月, 2013 2 次提交
    • T
      cgroup: remove gratuituous BUG_ON()s from rebind_subsystems() · a698b448
      Tejun Heo 提交于
      rebind_subsystems() performs santiy checks even on subsystems which
      aren't specified to be added or removed and the checks aren't all that
      useful given that these are in a very cold path while the violations
      they check would trip up in much hotter paths.
      
      Let's remove these from rebind_subsystems().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a698b448
    • T
      cgroup: move module ref handling into rebind_subsystems() · 1d5be6b2
      Tejun Heo 提交于
      Module ref handling in cgroup is rather weird.
      parse_cgroupfs_options() grabs all the modules for the specified
      subsystems.  A module ref is kept if the specified subsystem is newly
      bound to the hierarchy.  If not, or the operation fails, the refs are
      dropped.  This scatters module ref handling across multiple functions
      making it difficult to track.  It also make the function nasty to use
      for dynamic subsystem binding which is necessary for the planned
      unified hierarchy.
      
      There's nothing which requires the subsystem modules to be pinned
      between parse_cgroupfs_options() and rebind_subsystems() in both mount
      and remount paths.  parse_cgroupfs_options() can just parse and
      rebind_subsystems() can handle pinning the subsystems that it wants to
      bind, which is a natural part of its task - binding - anyway.
      
      Move module ref handling into rebind_subsystems() which makes the code
      a lot simpler - modules are gotten iff it's gonna be bound and put iff
      unbound or binding fails.
      
      v2: Li pointed out that if a controller module is unloaded between
          parsing and binding, rebind_subsystems() won't notice the missing
          controller as it only iterates through existing controllers.  Fix
          it by updating rebind_subsystems() to compare @added_mask to
          @pinned and fail with -ENOENT if they don't match.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1d5be6b2
  3. 13 7月, 2013 8 次提交
    • T
      cgroup: move number_of_cgroups test out of rebind_subsystems() into cgroup_remount() · f172e67c
      Tejun Heo 提交于
      rebind_subsystems() currently fails if the hierarchy has any !root
      cgroups; however, on the planned unified hierarchy,
      rebind_subsystems() will be used while populated.  Move the test to
      cgroup_remount(), which is the only place the test is necessary
      anyway.
      
      As it's impossible for the other two callers of rebind_subsystems() to
      have populated hierarchy, this doesn't make any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f172e67c
    • T
      cgroup: make rebind_subsystems() handle file additions and removals with proper error handling · 3126121f
      Tejun Heo 提交于
      Currently, creating and removing cgroup files in the root directory
      are handled separately from the actual subsystem binding and unbinding
      which happens in rebind_subsystems().  Also, rebind_subsystems() users
      aren't handling file creation errors properly.  Let's integrate
      top_cgroup file handling into rebind_subsystems() so that it's simpler
      to use and everyone handles file creation errors correctly.
      
      * On a successful return, rebind_subsystems() is guaranteed to have
        created all files of the new subsystems and deleted the ones
        belonging to the removed subsystems.  After a failure, no file is
        created or removed.
      
      * cgroup_remount() no longer needs to make explicit populate/clear
        calls as it's all handled by rebind_subsystems(), and it gets proper
        error handling automatically.
      
      * cgroup_mount() has been updated such that the root dentry and cgroup
        are linked before rebind_subsystems().  Also, the init_cred dancing
        and base file handling are moved right above rebind_subsystems()
        call and proper error handling for the base files is added.  While
        at it, add a comment explaining what's going on with the cred thing.
      
      * cgroup_kill_sb() calls rebind_subsystems() to unbind all subsystems
        which now implies removing all subsystem files which requires the
        directory's i_mutex.  Grab it.  This means that files on the root
        cgroup are removed earlier - they used to be deleted from generic
        super_block cleanup from vfs.  This doesn't lead to any functional
        difference and it's cleaner to do the clean up explicitly for all
        files.
      
      Combined with the previous changes, this makes all cgroup file
      creation errors handled correctly.
      
      v2: Added comment on init_cred.
      
      v3: Li spotted that cgroup_mount() wasn't freeing tmp_links after base
          file addition failure.  Fix it by adding free_tmp_links error
          handling label.
      
      v4: v3 introduced build bugs which got noticed by Fengguang's awesome
          kbuild test robot.  Fixed, and shame on me.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      3126121f
    • T
      cgroup: use for_each_subsys() instead of for_each_root_subsys() in cgroup_populate/clear_dir() · b420ba7d
      Tejun Heo 提交于
      rebind_subsystems() will be updated to handle file creations and
      removals with proper error handling and to do that will need to
      perform file operations before actually adding the subsystem to the
      hierarchy.
      
      To enable such usage, update cgroup_populate/clear_dir() to use
      for_each_subsys() instead of for_each_root_subsys() so that they
      operate on all subsystems specified by @subsys_mask whether that
      subsystem is currently bound to the hierarchy or not.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b420ba7d
    • T
      cgroup: update error handling in cgroup_populate_dir() · bee55099
      Tejun Heo 提交于
      cgroup_populate_dir() didn't use to check whether the actual file
      creations were successful and could return success with only subset of
      the requested files created, which is nasty.
      
      This patch udpates cgroup_populate_dir() so that it either succeeds
      with all files or fails with no file.
      
      v2: The original patch also converted for_each_root_subsys() usages to
          for_each_subsys() without explaining why.  That part has been
          moved to a separate patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      bee55099
    • T
      cgroup: separate out cgroup_base_files[] handling out of cgroup_populate/clear_dir() · 628f7cd4
      Tejun Heo 提交于
      cgroup_populate/clear_dir() currently take @base_files and adds and
      removes, respectively, cgroup_base_files[] to the directory.  File
      additions and removals are being reorganized for proper error handling
      and more dynamic handling for the unified hierarchy, and mixing base
      and subsys file handling into the same functions gets a bit confusing.
      
      This patch moves base file handling out of cgroup_populate/clear_dir()
      into their users - cgroup_mount(), cgroup_create() and
      cgroup_destroy_locked().
      
      Note that this changes the behavior of base file removal.  If
      @base_files is %true, cgroup_clear_dir() used to delete files
      regardless of cftype until there's no files left.  Now, only files
      with matching cfts are removed.  As files can only be created by the
      base or registered cftypes, this shouldn't result in any behavior
      difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      628f7cd4
    • T
      cgroup: fix cgroup_add_cftypes() error handling · 9ccece80
      Tejun Heo 提交于
      cgroup_add_cftypes() uses cgroup_cfts_commit() to actually create the
      files; however, both functions ignore actual file creation errors and
      just assume success.  This can lead to, for example, blkio hierarchy
      with some of the cgroups with only subset of interface files populated
      after cfq-iosched is loaded under heavy memory pressure, which is
      nasty.
      
      This patch updates cgroup_cfts_commit() and cgroup_add_cftypes() to
      guarantee that all files are created on success and no file is created
      on failure.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9ccece80
    • T
      cgroup: fix error path of cgroup_addrm_files() · b1f28d31
      Tejun Heo 提交于
      cgroup_addrm_files() mishandled error return value from
      cgroup_add_file() and returns error iff the last file fails to create.
      As we're in the process of cleaning up file add/rm error handling and
      will reliably propagate file creation failures, there's no point in
      keeping adding files after a failure.
      
      Replace the broken error collection logic with immediate error return.
      While at it, add lockdep assertions and function comment.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b1f28d31
    • T
      cgroup: minor updates around cgroup_clear_directory() · 8f89140a
      Tejun Heo 提交于
      * Rename it to cgroup_clear_dir() and make it take the pointer to the
        target cgroup instead of the the dentry.  This makes the function
        consistent with its counterpart - cgroup_populate_dir().
      
      * Move cgroup_clear_directory() invocation from cgroup_d_remove_dir()
        to cgroup_remount() so that the function doesn't have to determine
        the cgroup pointer back from the dentry.  cgroup_d_remove_dir() now
        only deals with vfs, which is slightly cleaner.
      
      This patch doesn't introduce any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8f89140a
  4. 30 6月, 2013 1 次提交
    • T
      cgroup: CGRP_ROOT_SUBSYS_BOUND should also be ignored when mounting an existing hierarchy · c7ba8287
      Tejun Heo 提交于
      0ce6cba3 ("cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when
      comparing mount options") only updated the remount path but
      CGRP_ROOT_SUBSYS_BOUND should also be ignored when comparing options
      while mounting an existing hierarchy.  As option mismatch triggers a
      warning but doesn't fail the mount without sane_behavior, this only
      triggers a spurious warning message.
      
      Fix it by only comparing CGRP_ROOT_OPTION_MASK bits when comparing new
      and existing root options.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c7ba8287
  5. 28 6月, 2013 2 次提交
    • T
      cgroup: CGRP_ROOT_SUBSYS_BOUND should be ignored when comparing mount options · 0ce6cba3
      Tejun Heo 提交于
      1672d040 ("cgroup: fix cgroupfs_root early destruction path")
      introduced CGRP_ROOT_SUBSYS_BOUND which is used to mark completion of
      subsys binding on a new root; however, this broke remounts.
      cgroup_remount() doesn't allow changing root options via remount and
      CGRP_ROOT_SUBSYS_BOUND, which is set on all fully initialized roots,
      makes the function reject all remounts.
      
      Fix it by putting the options part in the lower 16 bits of root->flags
      and masking the comparions.  While at it, make cgroup_remount() emit
      an error message explaining why it's rejecting a remount request, so
      that it's less of a mystery.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ce6cba3
    • T
      cgroup: fix deadlock on cgroup_mutex via drop_parsed_module_refcounts() · e2bd416f
      Tejun Heo 提交于
      eb178d06 ("cgroup: grab cgroup_mutex in
      drop_parsed_module_refcounts()") made drop_parsed_module_refcounts()
      grab cgroup_mutex to make lockdep assertion in for_each_subsys()
      happy.  Unfortunately, cgroup_remount() calls the function while
      holding cgroup_mutex in its failure path leading to the following
      deadlock.
      
      # mount -t cgroup -o remount,memory,blkio cgroup blkio
      
       cgroup: option changes via remount are deprecated (pid=525 comm=mount)
      
       =============================================
       [ INFO: possible recursive locking detected ]
       3.10.0-rc4-work+ #1 Not tainted
       ---------------------------------------------
       mount/525 is trying to acquire lock:
        (cgroup_mutex){+.+.+.}, at: [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
      
       but task is already holding lock:
        (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
      	CPU0
      	----
         lock(cgroup_mutex);
         lock(cgroup_mutex);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       4 locks held by mount/525:
        #0:  (&type->s_umount_key#30){+.+...}, at: [<ffffffff811e9a0d>] do_mount+0x2bd/0xa30
        #1:  (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [<ffffffff8110e4d3>] cgroup_remount+0x43/0x200
        #2:  (cgroup_mutex){+.+.+.}, at: [<ffffffff8110e4e1>] cgroup_remount+0x51/0x200
        #3:  (cgroup_root_mutex){+.+.+.}, at: [<ffffffff8110e4ef>] cgroup_remount+0x5f/0x200
      
       stack backtrace:
       CPU: 2 PID: 525 Comm: mount Not tainted 3.10.0-rc4-work+ #1
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
        ffffffff829651f0 ffff88000ec2fc28 ffffffff81c24bb1 ffff88000ec2fce8
        ffffffff810f420d 0000000000000006 0000000000000001 0000000000000056
        ffff8800153b4640 ffff880000000000 ffffffff81c2e468 ffff8800153b4640
       Call Trace:
        [<ffffffff81c24bb1>] dump_stack+0x19/0x1b
        [<ffffffff810f420d>] __lock_acquire+0x15dd/0x1e60
        [<ffffffff810f531c>] lock_acquire+0x9c/0x1f0
        [<ffffffff81c2a805>] mutex_lock_nested+0x65/0x410
        [<ffffffff8110a3e1>] drop_parsed_module_refcounts+0x21/0xb0
        [<ffffffff8110e63e>] cgroup_remount+0x1ae/0x200
        [<ffffffff811c9bb2>] do_remount_sb+0x82/0x190
        [<ffffffff811e9d41>] do_mount+0x5f1/0xa30
        [<ffffffff811ea203>] SyS_mount+0x83/0xc0
        [<ffffffff81c2fb82>] system_call_fastpath+0x16/0x1b
      
      Fix it by moving the drop_parsed_module_refcounts() invocation outside
      cgroup_mutex.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e2bd416f
  6. 27 6月, 2013 4 次提交
    • T
      cgroup: always use RCU accessors for protected accesses · a4ea1cc9
      Tejun Heo 提交于
      kernel/cgroup.c still has places where a RCU pointer is set and
      accessed directly without going through RCU_INIT_POINTER() or
      rcu_dereference_protected().  They're all properly protected accesses
      so nothing is broken but it leads to spurious sparse RCU address space
      warnings.
      
      Substitute direct accesses with RCU_INIT_POINTER() and
      rcu_dereference_protected().  Note that %true is specified as the
      extra condition for all derference updates.  This isn't ideal as all
      it does is suppressing warning without actually policing
      synchronization rules; however, most are scheduled to be removed
      pretty soon along with css_id itself, so no reason to be more
      elaborate.
      
      Combined with the previous changes, this removes all RCU related
      sparse warnings from cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by; Li Zefan <lizefan@huawei.com>
      a4ea1cc9
    • T
      cgroup: fix RCU accesses around task->cgroups · a8ad805c
      Tejun Heo 提交于
      There are several places in kernel/cgroup.c where task->cgroups is
      accessed and modified without going through proper RCU accessors.
      None is broken as they're all lock protected accesses; however, this
      still triggers sparse RCU address space warnings.
      
      * Consistently use task_css_set() for task->cgroups dereferencing.
      
      * Use RCU_INIT_POINTER() to clear task->cgroups to &init_css_set on
        exit.
      
      * Remove unnecessary rcu_dereference_raw() from cset->subsys[]
        dereference in cgroup_exit().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a8ad805c
    • T
      cgroup: grab cgroup_mutex in drop_parsed_module_refcounts() · eb178d06
      Tejun Heo 提交于
      This isn't strictly necessary as all subsystems specified in
      @subsys_mask are guaranteed to be pinned; however, it does spuriously
      trigger lockdep warning.  Let's grab cgroup_mutex around it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      eb178d06
    • T
      cgroup: fix cgroupfs_root early destruction path · 1672d040
      Tejun Heo 提交于
      cgroupfs_root used to have ->actual_subsys_mask in addition to
      ->subsys_mask.  a8a648c4 ("cgroup: remove
      cgroup->actual_subsys_mask") removed it noting that the subsys_mask is
      essentially temporary and doesn't belong in cgroupfs_root; however,
      the patch made it impossible to tell whether a cgroupfs_root actually
      has the subsystems bound or just have the bits set leading to the
      following BUG when trying to mount with subsystems which are already
      mounted elsewhere.
      
       kernel BUG at kernel/cgroup.c:1038!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       ...
       CPU: 1 PID: 7973 Comm: mount Tainted: G        W    3.10.0-rc7-next-20130625-sasha-00011-g1c1dc0e #1105
       task: ffff880fc0ae8000 ti: ffff880fc0b9a000 task.ti: ffff880fc0b9a000
       RIP: 0010:[<ffffffff81249b29>]  [<ffffffff81249b29>] rebind_subsystems+0x409/0x5f0
       ...
       Call Trace:
        [<ffffffff8124bd4f>] cgroup_kill_sb+0xff/0x210
        [<ffffffff813d21af>] deactivate_locked_super+0x4f/0x90
        [<ffffffff8124f3b3>] cgroup_mount+0x673/0x6e0
        [<ffffffff81257169>] cpuset_mount+0xd9/0x110
        [<ffffffff813d2580>] mount_fs+0xb0/0x2d0
        [<ffffffff81404afd>] vfs_kern_mount+0xbd/0x180
        [<ffffffff814070b5>] do_new_mount+0x145/0x2c0
        [<ffffffff814085d6>] do_mount+0x356/0x3c0
        [<ffffffff8140873d>] SyS_mount+0xfd/0x140
        [<ffffffff854eb600>] tracesys+0xdd/0xe2
      
      We still want rebind_subsystems() to take added/removed masks, so
      let's fix it by marking whether a cgroupfs_root has finished binding
      or not.  Also, document what's going on around ->subsys_mask
      initialization so that similar mistakes aren't repeated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1672d040
  7. 26 6月, 2013 3 次提交
    • T
      cgroup: reserve ID 0 for dummy_root and 1 for unified hierarchy · fc76df70
      Tejun Heo 提交于
      Before 1a574231 ("cgroup: make hierarchy_id use cyclic idr"),
      hierarchy IDs were allocated from 0.  As the dummy hierarchy was
      always the one first initialized, it got assigned 0 and all other
      hierarchies from 1.  The patch accidentally changed the minimum
      useable ID to 2.
      
      Let's restore ID 0 for dummy_root and while at it reserve 1 for
      unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      fc76df70
    • T
      cgroup: implement for_each_[builtin_]subsys() · 30159ec7
      Tejun Heo 提交于
      There are quite a few places where all loaded [builtin] subsys are
      iterated.  Implement for_each_[builtin_]subsys() and replace manual
      iterations with those to simplify those places a bit.  The new
      iterators automatically skip NULL subsystems.  This shouldn't cause
      any functional difference.
      
      Iteration loops which scan all subsystems and then skipping modular
      ones explicitly are converted to use for_each_builtin_subsys().
      
      While at it, reorder variable declarations and adjust whitespaces a
      bit in the affected functions.
      
      v2: Add lockdep_assert_held() in for_each_subsys() and add comments
          about synchronization as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      30159ec7
    • T
      cgroup: move init_css_set initialization inside cgroup_mutex · 82fe9b0d
      Tejun Heo 提交于
      cgroup_init() was doing init_css_set initialization outside
      cgroup_mutex, which is fine but we want to add lockdep annotation on
      subsystem iterations and cgroup_init() will trigger it spuriously.
      Move init_css_set initialization inside cgroup_mutex.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      82fe9b0d
  8. 25 6月, 2013 4 次提交
    • T
      cgroup: s/for_each_subsys()/for_each_root_subsys()/ · 5549c497
      Tejun Heo 提交于
      for_each_subsys() walks over subsystems attached to a hierarchy and
      we're gonna add iterators which walk over all available subsystems.
      Rename for_each_subsys() to for_each_root_subsys() so that it's more
      appropriately named and for_each_subsys() can be used to iterate all
      subsystems.
      
      While at it, remove unnecessary underbar prefix from macro arguments,
      put them inside parentheses, and adjust indentation for the two
      for_each_*() macros.
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5549c497
    • T
      cgroup: clean up find_css_set() and friends · b326f9d0
      Tejun Heo 提交于
      find_css_set() passes uninitialized on-stack template[] array to
      find_existing_css_set() which sets the entries for all subsystems.
      Passing around an uninitialized array is a bit icky and we want to
      introduce an iterator which only iterates loaded subsystems.  Let's
      initialize it on definition.
      
      While at it, also make the following cosmetic cleanups.
      
      * Convert to proper /** comments.
      
      * Reorder variable declarations.
      
      * Replace comment on synchronization with lockdep_assert_held().
      
      This patch doesn't make any functional differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b326f9d0
    • T
      cgroup: remove cgroup->actual_subsys_mask · a8a648c4
      Tejun Heo 提交于
      cgroup curiously has two subsystem masks, ->subsys_mask and
      ->actual_subsys_mask.  The latter only exists because the new target
      subsys_mask is passed into rebind_subsystems() via @root>subsys_mask.
      rebind_subsystems() needs to know what the current mask is to decide
      how to reach the target mask so ->actual_subsys_mask is used as the
      temp location to remember the current state.
      
      Adding a temporary field to a permanent data structure is rather silly
      and can be misleading.  Update rebind_subsystems() to take @added_mask
      and @removed_mask instead and remove @root->actual_subsys_mask.
      
      This patch shouldn't introduce any behavior changes.
      
      v2: Comment and description updated as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a8a648c4
    • T
      cgroup: prefix global variables with "cgroup_" · 9871bf95
      Tejun Heo 提交于
      Global variable names in kernel/cgroup.c are asking for trouble -
      subsys, roots, rootnode and so on.  Rename them to have "cgroup_"
      prefix.
      
      * s/subsys/cgroup_subsys/
      
      * s/rootnode/cgroup_dummy_root/
      
      * s/dummytop/cgroup_cummy_top/
      
      * s/roots/cgroup_roots/
      
      * s/root_count/cgroup_root_count/
      
      This patch is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9871bf95
  9. 19 6月, 2013 7 次提交
  10. 18 6月, 2013 1 次提交
    • T
      cgroup: disallow rename(2) if sane_behavior · 6db8e85c
      Tejun Heo 提交于
      cgroup's rename(2) isn't a proper migration implementation - it can't
      move the cgroup to a different parent in the hierarchy.  All it can do
      is swapping the name string for that cgroup.  This isn't useful and
      can mislead users to think that cgroup supports proper cgroup-level
      migration.  Disallow rename(2) if sane_behavior.
      
      v2: Fail with -EPERM instead of -EINVAL so that it matches the vfs
          return value when ->rename is not implemented as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6db8e85c
  11. 14 6月, 2013 4 次提交
    • T
      cgroup: use percpu refcnt for cgroup_subsys_states · d3daf28d
      Tejun Heo 提交于
      A css (cgroup_subsys_state) is how each cgroup is represented to a
      controller.  As such, it can be used in hot paths across the various
      subsystems different controllers are associated with.
      
      One of the common operations is reference counting, which up until now
      has been implemented using a global atomic counter and can have
      significant adverse impact on scalability.  For example, css refcnt
      can be gotten and put multiple times by blkcg for each IO request.
      For highops configurations which try to do as much per-cpu as
      possible, the global frequent refcnting can be very expensive.
      
      In general, given the various and hugely diverse paths css's end up
      being used from, we need to make it cheap and highly scalable.  In its
      usage, css refcnting isn't very different from module refcnting.
      
      This patch converts css refcnting to use the recently added
      percpu_ref.  css_get/tryget/put() directly maps to the matching
      percpu_ref operations and the deactivation logic is no longer
      necessary as percpu_ref already has refcnt killing.
      
      The only complication is that as the refcnt is per-cpu,
      percpu_ref_kill() in itself doesn't ensure that further tryget
      operations will fail, which we need to guarantee before invoking
      ->css_offline()'s.  This is resolved collecting kill confirmation
      using percpu_ref_kill_and_confirm() and initiating the offline phase
      of destruction after all css refcnt's are confirmed to be seen as
      killed on all CPUs.  The previous patches already splitted destruction
      into two phases, so percpu_ref_kill_and_confirm() can be hooked up
      easily.
      
      This patch removes css_refcnt() which is used for rcu dereference
      sanity check in css_id().  While we can add a percpu refcnt API to ask
      the same question, css_id() itself is scheduled to be removed fairly
      soon, so let's not bother with it.  Just drop the sanity check and use
      rcu_dereference_raw() instead.
      
      v2: - init_cgroup_css() was calling percpu_ref_init() without checking
            the return value.  This causes two problems - the obvious lack
            of error handling and percpu_ref_init() being called from
            cgroup_init_subsys() before the allocators are up, which
            triggers warnings but doesn't cause actual problems as the
            refcnt isn't used for roots anyway.  Fix both by moving
            percpu_ref_init() to cgroup_create().
      
          - The base references were put too early by
            percpu_ref_kill_and_confirm() and cgroup_offline_fn() put the
            refs one extra time.  This wasn't noticeable because css's go
            through another RCU grace period before being freed.  Update
            cgroup_destroy_locked() to grab an extra reference before
            killing the refcnts.  This problem was noticed by Kent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKent Overstreet <koverstreet@google.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: "Alasdair G. Kergon" <agk@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      d3daf28d
    • T
      cgroup: split cgroup destruction into two steps · ea15f8cc
      Tejun Heo 提交于
      Split cgroup_destroy_locked() into two steps and put the latter half
      into cgroup_offline_fn() which is executed from a work item.  The
      latter half is responsible for offlining the css's, removing the
      cgroup from internal lists, and propagating release notification to
      the parent.  The separation is to allow using percpu refcnt for css.
      
      Note that this allows for other cgroup operations to happen between
      the first and second halves of destruction, including creating a new
      cgroup with the same name.  As the target cgroup is marked DEAD in the
      first half and cgroup internals don't care about the names of cgroups,
      this should be fine.  A comment explaining this will be added by the
      next patch which implements the actual percpu refcnting.
      
      As RCU freeing is guaranteed to happen after the second step of
      destruction, we can use the same work item for both.  This patch
      renames cgroup->free_work to ->destroy_work and uses it for both
      purposes.  INIT_WORK() is now performed right before queueing the work
      item.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ea15f8cc
    • T
      cgroup: reorder the operations in cgroup_destroy_locked() · 455050d2
      Tejun Heo 提交于
      This patch reorders the operations in cgroup_destroy_locked() such
      that the userland visible parts happen before css offlining and
      removal from the ->sibling list.  This will be used to make css use
      percpu refcnt.
      
      While at it, split out CGRP_DEAD related comment from the refcnt
      deactivation one and correct / clarify how different guarantees are
      met.
      
      While this patch changes the specific order of operations, it
      shouldn't cause any noticeable behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      455050d2
    • T
      cgroup: remove cgroup->count and use · 6f3d828f
      Tejun Heo 提交于
      cgroup->count tracks the number of css_sets associated with the cgroup
      and used only to verify that no css_set is associated when the cgroup
      is being destroyed.  It's superflous as the destruction path can
      simply check whether cgroup->cset_links is empty instead.
      
      Drop cgroup->count and check ->cset_links directly from
      cgroup_destroy_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6f3d828f