1. 15 7月, 2014 2 次提交
    • T
      cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes · 5577964e
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes is used for both the unified
      default hierarchy and legacy ones and subsystems can mark each file
      with either CFTYPE_ONLY_ON_DFL or CFTYPE_INSANE if it has to appear
      only on one of them.  This is quite hairy and error-prone.  Also, we
      may end up exposing interface files to the default hierarchy without
      thinking it through.
      
      cgroup_subsys will grow two separate cftype arrays and apply each only
      on the hierarchies of the matching type.  This will allow organizing
      cftypes in a lot clearer way and encourage subsystems to scrutinize
      the interface which is being exposed in the new default hierarchy.
      
      In preparation, this patch renames cgroup_subsys->base_cftypes to
      cgroup_subsys->legacy_cftypes.  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Aristeu Rozanski <aris@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      5577964e
    • T
      cgroup: split cgroup_base_files[] into cgroup_{dfl|legacy}_base_files[] · a14c6874
      Tejun Heo 提交于
      Currently cgroup_base_files[] contains the cgroup core interface files
      for both legacy and default hierarchies with each file tagged with
      CFTYPE_INSANE and CFTYPE_ONLY_ON_DFL.  This is difficult to read.
      
      Let's separate it out to two separate tables, cgroup_dfl_base_files[]
      and cgroup_legacy_base_files[], and use the appropriate one in
      cgroup_mkdir() depending on the hierarchy type.  This makes tagging
      each file unnecessary.
      
      This patch doesn't introduce any behavior changes.
      
      v2: cgroup_dfl_base_files[] was missing the termination entry
          triggering WARN in cgroup_init_cftypes() for 0day kernel testing
          robot.  Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jet Chen <jet.chen@intel.com>
      a14c6874
  2. 09 7月, 2014 9 次提交
    • T
      cgroup: clean up sane_behavior handling · 7b9a6ba5
      Tejun Heo 提交于
      After the previous patch to remove sane_behavior support from
      non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
      indicate the default hierarchy while parsing mount options.  This
      patch makes the following cleanups around it.
      
      * Don't show it in the mount option.  Eventually the default hierarchy
        will be assigned a different filesystem type.
      
      * As sane_behavior is no longer effective on non-default hierarchies
        and the default hierarchy doesn't accept any mount options,
        parse_cgroupfs_options() can consider sane_behavior mount option as
        indicating the default hierarchy and fail if any other options are
        specified with it.  While at it, remove one of the double blank
        lines in the function.
      
      * cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
        whether to mount the default hierarchy or not.
      
      * As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
        to select the default hierarchy or not during mount, it doesn't need
        to be set in the default hierarchy itself.  cgroup_init_early()
        updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7b9a6ba5
    • T
      cgroup: remove sane_behavior support on non-default hierarchies · aa6ec29b
      Tejun Heo 提交于
      sane_behavior has been used as a development vehicle for the default
      unified hierarchy.  Now that the default hierarchy is in place, the
      flag became redundant and confusing as its usage is allowed on all
      hierarchies.  There are gonna be either the default hierarchy or
      legacy ones.  Let's make that clear by removing sane_behavior support
      on non-default hierarchies.
      
      This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
      comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
      cgroup_on_dfl() with sane_behavior specific part dropped.
      
      On the default and legacy hierarchies w/o sane_behavior, this
      shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      aa6ec29b
    • T
      cgroup: make interface file "cgroup.sane_behavior" legacy-only · c1d5d42e
      Tejun Heo 提交于
      "cgroup.sane_behavior" is added to help distinguishing whether
      sane_behavior is in effect or not.  We now have the default hierarchy
      where the flag is always in effect and are planning to remove
      supporting sane behavior on the legacy hierarchies making this file on
      the default hierarchy rather pointless.  Let's make it legacy only and
      thus always zero.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c1d5d42e
    • T
      cgroup: remove CGRP_ROOT_OPTION_MASK · 7450e90b
      Tejun Heo 提交于
      cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
      reason to mask the flags.  Remove CGRP_ROOT_OPTION_MASK.
      
      This doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7450e90b
    • T
      cgroup: implement cgroup_subsys->depends_on · af0ba678
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      blkio piggybacking on memory is an implementation detail which
      preferably should be handled automatically without requiring explicit
      userland action.  To achieve that, this patch implements
      cgroup_subsys->depends_on which contains the mask of subsystems which
      should be enabled together when the subsystem is enabled.
      
      The previous patches already implemented the support for enabled but
      invisible subsystems and cgroup_subsys->depends_on can be easily
      implemented by updating cgroup_refresh_child_subsys_mask() so that it
      calculates cgroup->child_subsys_mask considering
      cgroup_subsys->depends_on of the explicitly enabled subsystems.
      
      Documentation/cgroups/unified-hierarchy.txt is updated to explain that
      subsystems may not become immediately available after being unused
      from userland and that dependency could be a factor in it.  As
      subsystems may already keep residual references, this doesn't
      significantly change how subsystem rebinding can be used.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      af0ba678
    • T
      cgroup: implement cgroup_subsys->css_reset() · b4536f0c
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      The previous patches added support for explicitly and implicitly
      enabled subsystems and showing/hiding their interface files.  An
      explicitly enabled subsystem may become implicitly enabled if it's
      turned off through "cgroup.subtree_control" but there are subsystems
      depending on it.  In such cases, the subsystem, as it's turned off
      when seen from userland, shouldn't enforce any resource control.
      Also, the subsystem may be explicitly turned on later again and its
      interface files should be as close to the intial state as possible.
      
      This patch adds cgroup_subsys->css_reset() which is invoked when a css
      is hidden.  The callback should disable resource control and reset the
      state to the vanilla state.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      b4536f0c
    • T
      cgroup: make interface files visible iff enabled on cgroup->subtree_control · f63070d3
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      The preceding patch distinguished cgroup->subtree_control and
      ->child_subsys_mask where the former is the subsystems explicitly
      configured by the userland and the latter is all enabled subsystems
      currently is equal to the former but will include subsystems
      implicitly enabled through dependency.
      
      Subsystems which are enabled due to dependency shouldn't be visible to
      userland.  This patch updates cgroup_subtree_control_write() and
      create_css() such that interface files are not created for implicitly
      enabled subsytems.
      
      * @visible paramter is added to create_css().  Interface files are
        created only when true.
      
      * If an already implicitly enabled subsystem is turned on through
        "cgroup.subtree_control", the existing css should be used.  css
        draining is skipped.
      
      * cgroup_subtree_control_write() computes the new target
        cgroup->child_subsys_mask and create/kill or show/hide csses
        accordingly.
      
      As the two subsystem masks are still kept identical, this patch
      doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      f63070d3
    • T
      cgroup: introduce cgroup->subtree_control · 667c2491
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      Previously, cgroup->child_subsys_mask directly reflected
      "cgroup.subtree_control" and the enabled subsystems in the child
      cgroups.  This patch adds cgroup->subtree_control which
      "cgroup.subtree_control" operates on.  cgroup->child_subsys_mask is
      now calculated from cgroup->subtree_control by
      cgroup_refresh_child_subsys_mask(), which sets it identical to
      cgroup->subtree_control for now.
      
      This will allow using cgroup->child_subsys_mask for all the enabled
      subsystems including the implicit ones and ->subtree_control for
      tracking the explicitly requested ones.  This patch keeps the two
      masks identical and doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      667c2491
    • T
      cgroup: reorganize cgroup_subtree_control_write() · c29adf24
      Tejun Heo 提交于
      Make the following two reorganizations to
      cgroup_subtree_control_write().  These are to prepare for future
      changes and shouldn't cause any functional difference.
      
      * Move availability above css offlining wait.
      
      * Move cgrp->child_subsys_mask update above new css creation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      c29adf24
  3. 05 6月, 2014 2 次提交
    • L
      cgroup: disallow disabled controllers on the default hierarchy · c731ae1d
      Li Zefan 提交于
      After booting with cgroup_disable=memory, I still saw memcg files
      in the default hierarchy, and I can write to them, though it won't
      take effect.
      
        # dmesg
        ...
        Disabling memory control group subsystem
        ...
        # mount -t cgroup -o __DEVEL__sane_behavior xxx /cgroup
        # ls /cgroup
        ...
        memory.failcnt                   memory.move_charge_at_immigrate
        memory.force_empty               memory.numa_stat
        memory.limit_in_bytes            memory.oom_control
        ...
        # cat /cgroup/memory.usage_in_bytes
        0
      
      tj: Minor comment update.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c731ae1d
    • L
      cgroup: don't destroy the default root · 1f779fb2
      Li Zefan 提交于
      The default root is allocated and initialized at boot phase, so we
      shouldn't destroy the default root when it's umounted, otherwise
      it will lead to disaster.
      
      Just try mount and then umount the default root, and the kernel will
      crash immediately.
      
      v2:
      - No need to check for CSS_NO_REF in cgroup_get/put(). (Tejun)
      - Better call cgroup_put() for the default root in kill_sb(). (Tejun)
      - Add a comment.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f779fb2
  4. 03 6月, 2014 1 次提交
  5. 28 5月, 2014 1 次提交
  6. 20 5月, 2014 1 次提交
    • T
      cgroup: disallow debug controller on the default hierarchy · 5533e011
      Tejun Heo 提交于
      The debug controller, as its name suggests, exposes cgroup core
      internals to userland to aid debugging.  Unfortunately, except for the
      name, there's no provision to prevent its usage in production
      configurations and the controller is widely enabled and mounted
      leaking internal details to userland.  Like most other debug
      information, the information exposed by debug isn't interesting even
      for debugging itself once the related parts are working reliably.
      
      This controller has no reason for existing.  This patch implements
      cgrp_dfl_root_inhibit_ss_mask which can suppress specific subsystems
      on the default hierarchy and adds the debug subsystem to it so that it
      can be gradually deprecated as usages move towards the unified
      hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5533e011
  7. 17 5月, 2014 10 次提交
    • T
      cgroup: convert cgroup_has_live_children() into css_has_online_children() · f3d46500
      Tejun Heo 提交于
      Now that cgroup liveliness and css onliness are the same state,
      convert cgroup_has_live_children() into css_has_online_children() so
      that it can be used for actual csses too.  The function now uses
      css_for_each_child() for iteration and is published.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f3d46500
    • T
      cgroup: use CSS_ONLINE instead of CGRP_DEAD · 184faf32
      Tejun Heo 提交于
      Use CSS_ONLINE on the self css to indicate whether a cgroup has been
      killed instead of CGRP_DEAD.  This will allow re-using css online test
      for cgroup liveliness test.  This doesn't introduce any functional
      change.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      184faf32
    • T
      cgroup: iterate cgroup_subsys_states directly · c2931b70
      Tejun Heo 提交于
      Currently, css_next_child() is implemented as finding the next child
      cgroup which has the css enabled, which used to be the only way to do
      it as only cgroups participated in sibling lists and thus could be
      iteratd.  This works as long as what's required during iteration is
      not missing online csses; however, it turns out that there are use
      cases where offlined but not yet released csses need to be iterated.
      This is difficult to implement through cgroup iteration the unified
      hierarchy as there may be multiple dying csses for the same subsystem
      associated with single cgroup.
      
      After the recent changes, the cgroup self and regular csses behave
      identically in how they're linked and unlinked from the sibling lists
      including assertion of CSS_RELEASED and css_next_child() can simply
      switch to iterating csses directly.  This both simplifies the logic
      and ensures that all visible non-released csses are included in the
      iteration whether there are multiple dying csses for a subsystem or
      not.
      
      As all other iterators depend on css_next_child() for sibling
      iteration, this changes behaviors of all css iterators.  Add and
      update explanations on the css states which are included in traversal
      to all iterators.
      
      As css iteration could always contain offlined csses, this shouldn't
      break any of the current users and new usages which need iteration of
      all on and offline csses can make use of the new semantics.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      c2931b70
    • T
      cgroup: introduce CSS_RELEASED and reduce css iteration fallback window · de3f0341
      Tejun Heo 提交于
      css iterations allow the caller to drop RCU read lock.  As long as the
      caller keeps the current position accessible, it can simply re-grab
      RCU read lock later and continue iteration.  This is achieved by using
      CGRP_DEAD to detect whether the current positions next pointer is safe
      to dereference and if not re-iterate from the beginning to the next
      position using ->serial_nr.
      
      CGRP_DEAD is used as the marker to invalidate the next pointer and the
      only requirement is that the marker is set before the next sibling
      starts its RCU grace period.  Because CGRP_DEAD is set at the end of
      cgroup_destroy_locked() but the cgroup is unlinked when the reference
      count reaches zero, we currently have a rather large window where this
      fallback re-iteration logic can be triggered.
      
      This patch introduces CSS_RELEASED which is set when a css is unlinked
      from its sibling list.  This still keeps the re-iteration logic
      working while drastically reducing the window of its activation.
      While at it, rewrite the comment in css_next_child() to reflect the
      new flag and better explain the synchronization.
      
      This will also enable iterating csses directly instead of through
      cgroups.
      
      v2: CSS_RELEASED now assigned to 1 << 2 as 1 << 0 is used by
          CSS_NO_REF.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      de3f0341
    • T
      cgroup: move cgroup->serial_nr into cgroup_subsys_state · 0cb51d71
      Tejun Heo 提交于
      We're moving towards using cgroup_subsys_states as the fundamental
      structural blocks.  All csses including the cgroup->self and actual
      ones now form trees through css->children and ->sibling which follow
      the same rules as what cgroup->children and ->sibling followed.  This
      patch moves cgroup->serial_nr which is used to implement css iteration
      into css.
      
      Note that all csses, regardless of their types, allocate their serial
      numbers from the same monotonically increasing counter.  This doesn't
      affect the ordering needed by css iteration or cause any other
      material behavior changes.  This will be used to update css iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0cb51d71
    • T
      cgroup: link all cgroup_subsys_states in their sibling lists · 1fed1b2e
      Tejun Heo 提交于
      Currently, while all csses have ->children and ->sibling, only the
      self csses of cgroups make use of them.  This patch makes all other
      csses to link themselves on the sibling lists too.  This will be used
      to update css iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1fed1b2e
    • T
      cgroup: move cgroup->sibling and ->children into cgroup_subsys_state · d5c419b6
      Tejun Heo 提交于
      We're moving towards using cgroup_subsys_states as the fundamental
      structural blocks.  Let's move cgroup->sibling and ->children into
      cgroup_subsys_state.  This is pure move without functional change and
      only cgroup->self's fields are actually used.  Other csses will make
      use of the fields later.
      
      While at it, update init_and_link_css() so that it zeroes the whole
      css before initializing it and remove explicit zeroing of ->flags.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d5c419b6
    • T
      cgroup: remove cgroup->parent · d51f39b0
      Tejun Heo 提交于
      cgroup->parent is redundant as cgroup->self.parent can also be used to
      determine the parent cgroup and we're moving towards using
      cgroup_subsys_states as the fundamental structural blocks.  This patch
      introduces cgroup_parent() which follows cgroup->self.parent and
      removes cgroup->parent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d51f39b0
    • T
      cgroup: remove css_parent() · 5c9d535b
      Tejun Heo 提交于
      cgroup in general is moving towards using cgroup_subsys_state as the
      fundamental structural component and css_parent() was introduced to
      convert from using cgroup->parent to css->parent.  It was quite some
      time ago and we're moving forward with making css more prominent.
      
      This patch drops the trivial wrapper css_parent() and let the users
      dereference css->parent.  While at it, explicitly mark fields of css
      which are public and immutable.
      
      v2: New usage from device_cgroup.c converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      5c9d535b
    • T
      cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css · 3b514d24
      Tejun Heo 提交于
      9395a450 ("cgroup: enable refcnting for root csses") enabled
      reference counting for root csses (cgroup_subsys_states) so that
      cgroup's self csses can be used to manage the lifetime of the
      containing cgroups.
      
      Unfortunately, this change was incorrect.  During early init,
      cgrp_dfl_root self css refcnt is used.  percpu_ref can't initialized
      during early init and its initialization is deferred till
      cgroup_init() time.  This means that cpu was using percpu_ref which
      wasn't properly initialized.  Due to the way percpu variables are laid
      out on x86, this didn't blow up immediately on x86 but ended up
      incrementing and decrementing the percpu variable at offset zero,
      whatever it may be; however, on other archs, this caused fault and
      early boot failure.
      
      As cgroup self csses for root cgroups of non-dfl hierarchies need
      working refcounting, we can't revert 9395a450.  This patch adds
      CSS_NO_REF which explicitly inhibits reference counting on the css and
      sets it on all normal (non-self) csses and cgroup_dfl_root self css.
      
      v2: cgrp_dfl_root.self is the offending one.  Set the flag on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NStephen Warren <swarren@nvidia.com>
      Tested-by: NStephen Warren <swarren@nvidia.com>
      Fixes: 9395a450 ("cgroup: enable refcnting for root csses")
      3b514d24
  8. 14 5月, 2014 14 次提交
    • T
      cgroup: use cgroup->self.refcnt for cgroup refcnting · 9d755d33
      Tejun Heo 提交于
      Currently cgroup implements refcnting separately using atomic_t
      cgroup->refcnt.  The destruction paths of cgroup and css are rather
      complex and bear a lot of similiarities including the use of RCU and
      bouncing to a work item.
      
      This patch makes cgroup use the refcnt of self css for refcnting
      instead of using its own.  This makes cgroup refcnting use css's
      percpu refcnt and share the destruction mechanism.
      
      * css_release_work_fn() and css_free_work_fn() are updated to handle
        both csses and cgroups.  This is a bit messy but should do until we
        can make cgroup->self a full css, which currently can't be done
        thanks to multiple hierarchies.
      
      * cgroup_destroy_locked() now performs
        percpu_ref_kill(&cgrp->self.refcnt) instead of cgroup_put(cgrp).
      
      * Negative refcnt sanity check in cgroup_get() is no longer necessary
        as percpu_ref already handles it.
      
      * Similarly, as a cgroup which hasn't been killed will never be
        released regardless of its refcnt value and percpu_ref has sanity
        check on kill, cgroup_is_dead() sanity check in cgroup_put() is no
        longer necessary.
      
      * As whether a refcnt reached zero or not can only be decided after
        the reference count is killed, cgroup_root->cgrp's refcnting can no
        longer be used to decide whether to kill the root or not.  Let's
        make cgroup_kill_sb() explicitly initiate destruction if the root
        doesn't have any children.  This makes sense anyway as unmounted
        cgroup hierarchy without any children should be destroyed.
      
      While this is a bit messy, this will allow pushing more bookkeeping
      towards cgroup->self and thus handling cgroups and csses in more
      uniform way.  In the very long term, it should be possible to
      introduce a base subsystem and convert the self css to a proper one
      making things whole lot simpler and unified.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9d755d33
    • T
      cgroup: enable refcnting for root csses · 9395a450
      Tejun Heo 提交于
      Currently, css_get(), css_tryget() and css_tryget_online() are noops
      for root csses as an optimization; however, we're planning to use css
      refcnts to track of cgroup lifetime too and root cgroups also need to
      be reference counted.  Since css has been converted to percpu_refcnt,
      the overhead of refcnting is miniscule and this optimization isn't too
      meaningful anymore.  Furthermore, controllers which optimize the root
      cgroup often never even invoke these functions in their hot paths.
      
      This patch enables refcnting for root csses too.  This makes CSS_ROOT
      flag unused and removes it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9395a450
    • T
      cgroup: bounce css release through css->destroy_work · 25e15d83
      Tejun Heo 提交于
      css release is planned to do more and would require process context.
      Bounce it through css->destroy_work.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      25e15d83
    • T
      cgroup: remove cgroup_destory_css_killed() · 249f3468
      Tejun Heo 提交于
      cgroup_destroy_css_killed() is cgroup destruction stage which happens
      after all csses are offlined.  After the recent updates, it no longer
      does anything other than putting the base reference.  This patch
      removes the function and makes cgroup_destroy_locked() put the base
      ref at the end isntead.
      
      This also makes cgroup->nr_css unnecessary.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      249f3468
    • T
      cgroup: move cgroup->sibling unlinking to cgroup_put() · 4e4e2847
      Tejun Heo 提交于
      Move cgroup->sibling unlinking from cgroup_destroy_css_killed() to
      cgroup_put().  This is later but still before the RCU grace period, so
      it doesn't break css_next_child() although there now is a larger
      window in which a dead cgroup is visible during css iteration.  As css
      iteration always could have included offline csses, this doesn't
      affect correctness; however, it does make css_next_child() fall back
      to reiterting mode more often.  This also makes cgroup_put() directly
      take cgroup_mutex, which limits where it can be called from.  These
      are not immediately problematic and will be dealt with later.
      
      This change enables simplification of cgroup destruction path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4e4e2847
    • T
      cgroup: move check_for_release(parent) call to the end of cgroup_destroy_locked() · 9e4173e1
      Tejun Heo 提交于
      Currently, check_for_release() on the parent of a destroyed cgroup is
      invoked from cgroup_destroy_css_killed().  This is because this is
      where the destroyed cgroup can be removed from the parent's children
      list.  check_for_release() tests the emptiness of the list directly,
      so invoking it before removing the cgroup from the list makes it think
      that the parent still has children even when it no longer does.
      
      This patch updates check_for_release() to use
      cgroup_has_live_children() instead of directly testing ->children
      emptiness and moves check_for_release(parent) earlier to the end of
      cgroup_destroy_locked().  As cgroup_has_live_children() ignores
      cgroups marked DEAD, check_for_release() functions correctly as long
      as it's called after asserting DEAD.
      
      This makes release notification slightly more timely and more
      importantly enables further simplification of cgroup destruction path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9e4173e1
    • T
      cgroup: separate out cgroup_has_live_children() from cgroup_destroy_locked() · cbc125ef
      Tejun Heo 提交于
      We're expecting another user.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cbc125ef
    • T
      cgroup: rename cgroup->dummy_css to ->self and move it to the top · 9d800df1
      Tejun Heo 提交于
      cgroup->dummy_css is used as the placeholder css when performing css
      oriended operations on the cgroup.  We're gonna shift more cgroup
      management to this css.  Let's rename it to ->self and move it to the
      top.
      
      This is pure rename and field relocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9d800df1
    • T
      cgroup: use restart_syscall() for mount retries · a015edd2
      Tejun Heo 提交于
      cgroup_mount() uses dumb delay-and-retry logic to wait for cgroup_root
      which is being destroyed.  The retry currently loops inside
      cgroup_mount() proper.  This patch makes it return with
      restart_syscall() instead so that retry travels out to userland
      boundary.
      
      This slightly simplifies the logic and more importantly makes the
      retry logic behave better when the wait for some reason becomes
      lengthy or infinite by allowing the operation to be suspended or
      terminated from userland.
      
      v2: The original patch forgot to free memory allocated for @opts.
          Fixed.  Caught by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a015edd2
    • T
      cgroup: remove cgroup_tree_mutex · 8353da1f
      Tejun Heo 提交于
      cgroup_tree_mutex was introduced to work around the circular
      dependency between cgroup_mutex and kernfs active protection - some
      kernfs file and directory operations needed cgroup_mutex putting
      cgroup_mutex under active protection but cgroup also needs to be able
      to access cgroup hierarchies and cftypes to determine which
      kernfs_nodes need to be removed.  cgroup_tree_mutex nested above both
      cgroup_mutex and kernfs active protection and used to protect the
      hierarchy and cftypes.  While this worked, it added a lot of double
      lockings and was generally cumbersome.
      
      kernfs provides a mechanism to opt out of active protection and cgroup
      was already using it for removal and subtree_control.  There's no
      reason to mix both methods of avoiding circular locking dependency and
      the preceding cgroup_kn_lock_live() changes applied it to all relevant
      cgroup kernfs operations making it unnecessary to nest cgroup_mutex
      under kernfs active protection.  The previous patch reversed the
      original lock ordering and put cgroup_mutex above kernfs active
      protection.
      
      After these changes, all cgroup_tree_mutex usages are now accompanied
      by cgroup_mutex making the former completely redundant.  This patch
      removes cgroup_tree_mutex and all its usages.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8353da1f
    • T
      cgroup: nest kernfs active protection under cgroup_mutex · 01f6474c
      Tejun Heo 提交于
      After the recent cgroup_kn_lock_live() changes, cgroup_mutex is no
      longer nested below kernfs active protection.  The two don't have any
      relationship now.
      
      This patch nests kernfs active protection under cgroup_mutex.  All
      cftype operations now require both cgroup_tree_mutex and cgroup_mutex,
      temporary cgroup_mutex releases over kernfs operations are removed,
      and cgroup_add/rm_cftypes() grab both mutexes.
      
      This makes cgroup_tree_mutex redundant, which will be removed by the
      next patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      01f6474c
    • T
      cgroup: use cgroup_kn_lock_live() in other cgroup kernfs methods · e76ecaee
      Tejun Heo 提交于
      Make __cgroup_procs_write() and cgroup_release_agent_write() use
      cgroup_kn_lock_live() and cgroup_kn_unlock() instead of
      cgroup_lock_live_group().  This puts the operations under both
      cgroup_tree_mutex and cgroup_mutex protection without circular
      dependency from kernfs active protection.  Also, this means that
      cgroup_mutex is no longer nested below kernfs active protection.
      There is no longer any place where the two locks interact.
      
      This leaves cgroup_lock_live_group() without any user.  Removed.
      
      This will help simplifying cgroup locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e76ecaee
    • T
      cgroup: factor out cgroup_kn_lock_live() and cgroup_kn_unlock() · a9746d8d
      Tejun Heo 提交于
      cgroup_mkdir(), cgroup_rmdir() and cgroup_subtree_control_write()
      share the logic to break active protection so that they can grab
      cgroup_tree_mutex which nests above active protection and/or remove
      self.  Factor out this logic into cgroup_kn_lock_live() and
      cgroup_kn_unlock().
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a9746d8d
    • T
      cgroup: move cgroup->kn->priv clearing to cgroup_rmdir() · cfc79d5b
      Tejun Heo 提交于
      The ->priv field of a cgroup directory kernfs_node points back to the
      cgroup.  This field is RCU cleared in cgroup_destroy_locked() for
      non-kernfs accesses from css_tryget_from_dir() and
      cgroupstats_build().
      
      As these are only applicable to cgroups which finished creation
      successfully and fully initialized cgroups are always removed by
      cgroup_rmdir(), this can be safely moved to the end of cgroup_rmdir().
      
      This will help simplifying cgroup locking and shouldn't introduce any
      behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cfc79d5b