1. 05 5月, 2014 1 次提交
    • T
      cgroup: make flags and subsys_masks unsigned int · 69dfa00c
      Tejun Heo 提交于
      There's no reason to use atomic bitops for cgroup_subsys_state->flags,
      cgroup_root->flags and various subsys_masks.  This patch updates those
      to use bitwise and/or operations instead and converts them form
      unsigned long to unsigned int.
      
      This makes the fields occupy (marginally) smaller space and makes it
      clear that they don't require atomicity.
      
      This patch doesn't cause any behavior difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      69dfa00c
  2. 26 4月, 2014 1 次提交
    • T
      cgroup: implement cgroup.populated for the default hierarchy · 842b597e
      Tejun Heo 提交于
      cgroup users often need a way to determine when a cgroup's
      subhierarchy becomes empty so that it can be cleaned up.  cgroup
      currently provides release_agent for it; unfortunately, this mechanism
      is riddled with issues.
      
      * It delivers events by forking and execing a userland binary
        specified as the release_agent.  This is a long deprecated method of
        notification delivery.  It's extremely heavy, slow and cumbersome to
        integrate with larger infrastructure.
      
      * There is single monitoring point at the root.  There's no way to
        delegate management of a subtree.
      
      * The event isn't recursive.  It triggers when a cgroup doesn't have
        any tasks or child cgroups.  Events for internal nodes trigger only
        after all children are removed.  This again makes it impossible to
        delegate management of a subtree.
      
      * Events are filtered from the kernel side.  "notify_on_release" file
        is used to subscribe to or suppress release event.  This is
        unnecessarily complicated and probably done this way because event
        delivery itself was expensive.
      
      This patch implements interface file "cgroup.populated" which can be
      used to monitor whether the cgroup's subhierarchy has tasks in it or
      not.  Its value is 0 if there is no task in the cgroup and its
      descendants; otherwise, 1, and kernfs_notify() notificaiton is
      triggers when the value changes, which can be monitored through poll
      and [di]notify.
      
      This is a lot ligther and simpler and trivially allows delegating
      management of subhierarchy - subhierarchy monitoring can block further
      propgation simply by putting itself or another process in the root of
      the subhierarchy and monitor events that it's interested in from there
      without interfering with monitoring higher in the tree.
      
      v2: Patch description updated as per Serge.
      
      v3: "cgroup.subtree_populated" renamed to "cgroup.populated".  The
          subtree_ prefix was a bit confusing because
          "cgroup.subtree_control" uses it to denote the tree rooted at the
          cgroup sans the cgroup itself while the populated state includes
          the cgroup itself.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Lennart Poettering <lennart@poettering.net>
      842b597e
  3. 23 4月, 2014 6 次提交
    • T
      cgroup: implement dynamic subtree controller enable/disable on the default hierarchy · f8f22e53
      Tejun Heo 提交于
      cgroup is switching away from multiple hierarchies and will use one
      unified default hierarchy where controllers can be dynamically enabled
      and disabled per subtree.  The default hierarchy will serve as the
      unified hierarchy to which all controllers are attached and a css on
      the default hierarchy would need to also serve the tasks of descendant
      cgroups which don't have the controller enabled - ie. the tree may be
      collapsed from leaf towards root when viewed from specific
      controllers.  This has been implemented through effective css in the
      previous patches.
      
      This patch finally implements dynamic subtree controller
      enable/disable on the default hierarchy via a new knob -
      "cgroup.subtree_control" which controls which controllers are enabled
      on the child cgroups.  Let's assume a hierarchy like the following.
      
        root - A - B - C
                     \ D
      
      root's "cgroup.subtree_control" determines which controllers are
      enabled on A.  A's on B.  B's on C and D.  This coincides with the
      fact that controllers on the immediate sub-level are used to
      distribute the resources of the parent.  In fact, it's natural to
      assume that resource control knobs of a child belong to its parent.
      Enabling a controller in "cgroup.subtree_control" declares that
      distribution of the respective resources of the cgroup will be
      controlled.  Note that this means that controller enable states are
      shared among siblings.
      
      The default hierarchy has an extra restriction - only cgroups which
      don't contain any task may have controllers enabled in
      "cgroup.subtree_control".  Combined with the other properties of the
      default hierarchy, this guarantees that, from the view point of
      controllers, tasks are only on the leaf cgroups.  In other words, only
      leaf csses may contain tasks.  This rules out situations where child
      cgroups compete against internal tasks of the parent, which is a
      competition between two different types of entities without any clear
      way to determine resource distribution between the two.  Different
      controllers handle it differently and all the implemented behaviors
      are ambiguous, ad-hoc, cumbersome and/or just wrong.  Having this
      structural constraints imposed from cgroup core removes the burden
      from controller implementations and enables showing one consistent
      behavior across all controllers.
      
      When a controller is enabled or disabled, css associations for the
      controller in the subtrees of each child should be updated.  After
      enabling, the whole subtree of a child should point to the new css of
      the child.  After disabling, the whole subtree of a child should point
      to the cgroup's css.  This is implemented by first updating cgroup
      states such that cgroup_e_css() result points to the appropriate css
      and then invoking cgroup_update_dfl_csses() which migrates all tasks
      in the affected subtrees to the self cgroup on the default hierarchy.
      
      * When read, "cgroup.subtree_control" lists all the currently enabled
        controllers on the children of the cgroup.
      
      * White-space separated list of controller names prefixed with either
        '+' or '-' can be written to "cgroup.subtree_control".  The ones
        prefixed with '+' are enabled on the controller and '-' disabled.
      
      * A controller can be enabled iff the parent's
        "cgroup.subtree_control" enables it and disabled iff no child's
        "cgroup.subtree_control" has it enabled.
      
      * If a cgroup has tasks, no controller can be enabled via
        "cgroup.subtree_control".  Likewise, if "cgroup.subtree_control" has
        some controllers enabled, tasks can't be migrated into the cgroup.
      
      * All controllers which aren't bound on other hierarchies are
        automatically associated with the root cgroup of the default
        hierarchy.  All the controllers which are bound to the default
        hierarchy are listed in the read-only file "cgroup.controllers" in
        the root directory.
      
      * "cgroup.controllers" in all non-root cgroups is read-only file whose
        content is equal to that of "cgroup.subtree_control" of the parent.
        This indicates which controllers can be used in the cgroup's
        "cgroup.subtree_control".
      
      This is still experimental and there are some holes, one of which is
      that ->can_attach() failure during cgroup_update_dfl_csses() may leave
      the cgroups in an undefined state.  The issues will be addressed by
      future patches.
      
      v2: Non-root cgroups now also have "cgroup.controllers".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f8f22e53
    • T
      cgroup: add css_set->dfl_cgrp · 6803c006
      Tejun Heo 提交于
      To implement the unified hierarchy behavior, we'll need to be able to
      determine the associated cgroup on the default hierarchy from css_set.
      Let's add css_set->dfl_cgrp so that it can be accessed conveniently
      and efficiently.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6803c006
    • T
      cgroup: teach css_task_iter about effective csses · 3ebb2b6e
      Tejun Heo 提交于
      Currently, css_task_iter iterates tasks associated with a css by
      visiting each css_set associated with the owning cgroup and walking
      tasks of each of them.  This works fine for !unified hierarchies as
      each cgroup has its own css for each associated subsystem on the
      hierarchy; however, on the planned unified hierarchy, a cgroup may not
      have csses associated and its tasks would be considered associated
      with the matching css of the nearest ancestor which has the subsystem
      enabled.
      
      This means that on the default unified hierarchy, just walking all
      tasks associated with a cgroup isn't enough to walk all tasks which
      are associated with the specified css.  If any of its children doesn't
      have the matching css enabled, task iteration should also include all
      tasks from the subtree.  We already added cgroup->e_csets[] to list
      all css_sets effectively associated with a given css and walk css_sets
      on that list instead to achieve such iteration.
      
      This patch updates css_task_iter iteration such that it walks css_sets
      on cgroup->e_csets[] instead of cgroup->cset_links if iteration is
      requested on an non-dummy css.  Thanks to the previous iteration
      update, this change can be achieved with the addition of
      css_task_iter->ss and minimal updates to css_advance_task_iter() and
      css_task_iter_start().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ebb2b6e
    • T
      cgroup: reorganize css_task_iter · 0f0a2b4f
      Tejun Heo 提交于
      This patch reorganizes css_task_iter so that adding effective css
      support is easier.
      
      * s/->cset_link/->cset_pos/ and s/->task/->task_pos/ for consistency
      
      * ->origin_css is used to determine whether the iteration reached the
        last css_set.  Replace it with explicit ->cset_head so that
        css_advance_task_iter() doesn't have to know the termination
        condition directly.
      
      * css_task_iter_next() currently assumes that it's walking list of
        cgrp_cset_link and reaches into the current cset through the current
        link to determine the termination conditions for task walking.  As
        this won't always be true for effective css walking, add
        ->tasks_head and ->mg_tasks_head and use them to control task
        walking so that css_task_iter_next() doesn't have to know how
        css_sets are being walked.
      
      This patch doesn't make any behavior changes.  The iteration logic
      stays unchanged after the patch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0f0a2b4f
    • T
      cgroup: implement cgroup->e_csets[] · 2d8f243a
      Tejun Heo 提交于
      On the default unified hierarchy, a cgroup may be associated with
      csses of its ancestors, which means that a css of a given cgroup may
      be associated with css_sets of descendant cgroups.  This means that we
      can't walk all tasks associated with a css by iterating the css_sets
      associated with the cgroup as there are css_sets which are pointing to
      the css but linked on the descendants.
      
      This patch adds per-subsystem list heads cgroup->e_csets[].  Any
      css_set which is pointing to a css is linked to
      css->cgroup->e_csets[$SUBSYS_ID] through
      css_set->e_cset_node[$SUBSYS_ID].  The lists are protected by
      css_set_rwsem and will allow us to walk all css_sets associated with a
      given css so that we can find out all associated tasks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2d8f243a
    • T
      cgroup: update cgroup->subsys_mask to ->child_subsys_mask and restore cgroup_root->subsys_mask · f392e51c
      Tejun Heo 提交于
      94419627 ("cgroup: move ->subsys_mask from cgroupfs_root to
      cgroup") moved ->subsys_mask from cgroup_root to cgroup to prepare for
      the unified hierarhcy; however, it turns out that carrying the
      subsys_mask of the children in the parent, instead of itself, is a lot
      more natural.  This patch restores cgroup_root->subsys_mask and morphs
      cgroup->subsys_mask into cgroup->child_subsys_mask.
      
      * Uses of root->cgrp.subsys_mask are restored to root->subsys_mask.
      
      * Remove automatic setting and clearing of cgrp->subsys_mask and
        instead just inherit ->child_subsys_mask from the parent during
        cgroup creation.  Note that this doesn't affect any current
        behaviors.
      
      * Undo __kill_css() separation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f392e51c
  4. 29 3月, 2014 1 次提交
  5. 19 3月, 2014 6 次提交
    • T
      cgroup: implement CFTYPE_ONLY_ON_DFL · 8cbbf2c9
      Tejun Heo 提交于
      This cftype flag makes the file only appear on the default hierarchy.
      This will later be used for cgroup.controllers file.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8cbbf2c9
    • T
      cgroup: make cgrp_dfl_root mountable · a2dd4247
      Tejun Heo 提交于
      cgrp_dfl_root will be used as the default unified hierarchy.  This
      patch makes cgrp_dfl_root mountable by making the following changes.
      
      * cgroup_init_early() now initializes cgrp_dfl_root w/
        CGRP_ROOT_SANE_BEHAVIOR.  The default hierarchy is always sane.
      
      * parse_cgroupfs_options() and cgroup_mount() are updated such that
        cgrp_dfl_root is mounted if sane_behavior is specified w/o any
        subsystems.
      
      * rebind_subsystems() now populates the root directory of
        cgrp_dfl_root.  Note that the function still guarantees success of
        rebinding subsystems to cgrp_dfl_root.  If populating fails while
        rebinding to cgrp_dfl_root, it whines but ignores the error.
      
      * For backward compatibility, the default hierarchy shows up in
        /proc/$PID/cgroup only after it's explicitly mounted so that
        userland which doesn't make use of it doesn't see any change.
      
      * "current_css_set_cg_links" file of debug cgroup now treats the
        default hierarchy the same as other hierarchies.  This is visible to
        userland.  Given that it's for debug controller, this should be
        fine.
      
      * While at it, implement cgroup_on_dfl() which tests whether a give
        cgroup is on the default hierarchy or not.
      
      The above changes make cgrp_dfl_root mostly equivalent to other
      controllers but the actual unified hierarchy behaviors are not
      implemented yet.  Let's plug child cgroup creation in cgrp_dfl_root
      from create_cgroup() for now.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a2dd4247
    • T
      cgroup: drop const from @buffer of cftype->write_string() · 4d3bb511
      Tejun Heo 提交于
      cftype->write_string() just passes on the writeable buffer from kernfs
      and there's no reason to add const restriction on the buffer.  The
      only thing const achieves is unnecessarily complicating parsing of the
      buffer.  Drop const from @buffer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>                                           
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      4d3bb511
    • T
      cgroup: rename cgroup_dummy_root and related names · 3dd06ffa
      Tejun Heo 提交于
      The dummy root will be repurposed to serve as the default unified
      hierarchy.  Let's rename things in preparation.
      
      * s/cgroup_dummy_root/cgrp_dfl_root/
      * s/cgroupfs_root/cgroup_root/ as we don't do fs part directly anymore
      * s/cgroup_root->top_cgroup/cgroup_root->cgrp/ for brevity
      
      This is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3dd06ffa
    • T
      cgroup: move ->subsys_mask from cgroupfs_root to cgroup · 94419627
      Tejun Heo 提交于
      cgroupfs_root->subsys_mask represents the controllers attached to the
      hierarchy.  This patch moves the field to cgroup.  Subsystem
      initialization and rebinding updates the top cgroup's subsys_mask.
      For !root cgroups, the subsys_mask bits are set from create_css() and
      cleared from kill_css(), which effectively means that all cgroups will
      have the same subsys_mask as the top cgroup.
      
      While this doesn't make any difference now, this will help
      implementation of the default unified hierarchy where !root cgroups
      may have subsets of the top_cgroup's subsys_mask.
      
      While at it, __kill_css() is split out of kill_css().  The former
      doesn't care about the subsys_mask while the latter becomes noop if
      the controller is already killed and clears the matching bit if not
      before proceeding to killing the css.  This will be used later by the
      default unified hierarchy implementation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      94419627
    • T
      cgroup: remove NULL checks from [pr_cont_]cgroup_{name|path}() · fdce6bf8
      Tejun Heo 提交于
      The dummy hierarchy is now a fully functional one and dummy_top has a
      kernfs_node associated with it.  Drop the NULL checks in
      [pr_cont_]cont_{name|path}() which are no longer necessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      fdce6bf8
  6. 25 2月, 2014 4 次提交
    • T
      cgroup: drop task_lock() protection around task->cgroups · 0e1d768f
      Tejun Heo 提交于
      For optimization, task_lock() is additionally used to protect
      task->cgroups.  The optimization is pretty dubious as either
      css_set_rwsem is grabbed anyway or PF_EXITING already protects
      task->cgroups.  It adds only overhead and confusion at this point.
      Let's drop task_[un]lock() and update comments accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0e1d768f
    • T
      cgroup: split process / task migration into four steps · 1958d2d5
      Tejun Heo 提交于
      Currently, process / task migration is a single operation which may
      fail depending on memory pressure or the involved controllers'
      ->can_attach() callbacks.  One problem with this approach is migration
      of multiple targets.  It's impossible to tell whether a given target
      will be successfully migrated beforehand and cgroup core can't keep
      track of enough states to roll back after intermediate failure.
      
      This is already an issue with cgroup_transfer_tasks().  Also, we're
      gonna need multiple target migration for unified hierarchy.
      
      This patch splits migration into four stages -
      cgroup_migrate_add_src(), cgroup_migrate_prepare_dst(),
      cgroup_migrate() and cgroup_migrate_finish(), where
      cgroup_migrate_prepare_dst() performs all the operations which may
      fail due to allocation failure without actually migrating the target.
      
      The four separate stages mean that, disregarding ->can_attach()
      failures, the success or failure of multi target migration can be
      determined before performing any actual migration.  If preparations of
      all targets succeed, the whole thing will succeed.  If not, the whole
      operation can fail without any side-effect.
      
      Since the previous patch to use css_set->mg_tasks to keep track of
      migration targets, the only thing which may need memory allocation
      during migration is the target css_sets.  cgroup_migrate_prepare()
      pins all source and target css_sets and link them up.  Note that this
      can be performed without holding threadgroup_lock even if the target
      is a process.  As long as cgroup_mutex is held, no new css_set can be
      put into play.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1958d2d5
    • T
      cgroup: use css_set->mg_tasks to track target tasks during migration · b3dc094e
      Tejun Heo 提交于
      Currently, while migrating tasks from one cgroup to another,
      cgroup_attach_task() builds a flex array of all target tasks;
      unfortunately, this has a couple issues.
      
      * Flex array has size limit.  On 64bit, struct task_and_cgroup is
        24bytes making the flex element limit around 87k.  It is a high
        number but not impossible to hit.  This means that the current
        cgroup implementation can't migrate a process with more than 87k
        threads.
      
      * Process migration involves memory allocation whose size is dependent
        on the number of threads the process has.  This means that cgroup
        core can't guarantee success or failure of multi-process migrations
        as memory allocation failure can happen in the middle.  This is in
        part because cgroup can't grab threadgroup locks of multiple
        processes at the same time, so when there are multiple processes to
        migrate, it is imposible to tell how many tasks are to be migrated
        beforehand.
      
        Note that this already affects cgroup_transfer_tasks().  cgroup
        currently cannot guarantee atomic success or failure of the
        operation.  It may fail in the middle and after such failure cgroup
        doesn't have enough information to roll back properly.  It just
        aborts with some tasks migrated and others not.
      
      To resolve the situation, this patch updates the migration path to use
      task->cg_list to track target tasks.  The previous patch already added
      css_set->mg_tasks and updated iterations in non-migration paths to
      include them during task migration.  This patch updates migration path
      to actually make use of it.
      
      Instead of putting onto a flex_array, each target task is moved from
      its css_set->tasks list to css_set->mg_tasks and the migration path
      keeps trace of all the source css_sets and the associated cgroups.
      Once all source css_sets are determined, the destination css_set for
      each is determined, linked to the matching source css_set and put on a
      separate list.
      
      To iterate the target tasks, migration path just needs to iterat
      through either the source or target css_sets, depending on whether
      migration has been committed or not, and the tasks on their ->mg_tasks
      lists.  cgroup_taskset is updated to contain the list_heads for source
      and target css_sets and the iteration cursor.  cgroup_taskset_*() are
      accordingly updated to walk through css_sets and their ->mg_tasks.
      
      This resolves the above listed issues with moderate additional
      complexity.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b3dc094e
    • T
      cgroup: add css_set->mg_tasks · c7561128
      Tejun Heo 提交于
      Currently, while migrating tasks from one cgroup to another,
      cgroup_attach_task() builds a flex array of all target tasks;
      unfortunately, this has a couple issues.
      
      * Flex array has size limit.  On 64bit, struct task_and_cgroup is
        24bytes making the flex element limit around 87k.  It is a high
        number but not impossible to hit.  This means that the current
        cgroup implementation can't migrate a process with more than 87k
        threads.
      
      * Process migration involves memory allocation whose size is dependent
        on the number of threads the process has.  This means that cgroup
        core can't guarantee success or failure of multi-process migrations
        as memory allocation failure can happen in the middle.  This is in
        part because cgroup can't grab threadgroup locks of multiple
        processes at the same time, so when there are multiple processes to
        migrate, it is imposible to tell how many tasks are to be migrated
        beforehand.
      
        Note that this already affects cgroup_transfer_tasks().  cgroup
        currently cannot guarantee atomic success or failure of the
        operation.  It may fail in the middle and after such failure cgroup
        doesn't have enough information to roll back properly.  It just
        aborts with some tasks migrated and others not.
      
      To resolve the situation, we're going to use task->cg_list during
      migration too.  Instead of building a separate array, target tasks
      will be linked into a dedicated migration list_head on the owning
      css_set.  Tasks on the migration list are treated the same as tasks on
      the usual tasks list; however, being on a separate list allows cgroup
      migration code path to keep track of the target tasks by simply
      keeping the list of css_sets with tasks being migrated, making
      unpredictable dynamic allocation unnecessary.
      
      In prepartion of such migration path update, this patch introduces
      css_set->mg_tasks list and updates css_set task iterations so that
      they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
      isn't used yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c7561128
  7. 14 2月, 2014 1 次提交
  8. 13 2月, 2014 6 次提交
  9. 12 2月, 2014 11 次提交
    • T
      cgroup: remove cgroupfs_root->refcnt · 776f02fa
      Tejun Heo 提交于
      Currently, cgroupfs_root and its ->top_cgroup are separated reference
      counted and the latter's is ignored.  There's no reason to do this
      separately.  This patch removes cgroupfs_root->refcnt and destroys
      cgroupfs_root when the top_cgroup is released.
      
      * cgroup_put() updated to ignore cgroup_is_dead() test for top
        cgroups.  cgroup_free_fn() updated to handle root destruction when
        releasing a top cgroup.
      
      * As root destruction is now bounced through cgroup destruction, it is
        asynchronous.  Update cgroup_mount() so that it waits for pending
        release which is currently implemented using msleep().  Converting
        this to proper wait_queue isn't hard but likely unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      776f02fa
    • T
      cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t · 3c9c825b
      Tejun Heo 提交于
      root->number_of_cgroups is currently an integer protected with
      cgroup_mutex.  Except for sanity checks and proc reporting, the only
      place it's used is to check whether the root has any child during
      remount; however, this is a bit flawed as the counter is not
      decremented when the cgroup is unlinked but when it's released,
      meaning that there could be an extended period where all cgroups are
      removed but remount is still not allowed because some internal objects
      are lingering.  While not perfect either, it'd be better to use
      emptiness test on root->top_cgroup.children.
      
      This patch updates cgroup_remount() to test top_cgroup's children
      instead, which makes number_of_cgroups only actual usage statistics
      printing in proc implemented in proc_cgroupstats_show().  Let's
      shorten its name and make it an atomic_t so that we don't have to
      worry about its synchronization.  It's purely auxiliary at this point.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3c9c825b
    • T
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo 提交于
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
    • T
      cgroup: remove cftype_set · 0adb0704
      Tejun Heo 提交于
      cftype_set was added primarily to allow registering the same cftype
      array more than once for different subsystems.  Nobody uses or needs
      such thing and it's already broken because each cftype has ->ss
      pointer which is initialized during registration.
      
      Let's add list_head ->node to cftype and use the first cftype entry in
      the array to link them instead of allocating separate cftype_set.
      While at it, trigger WARN if cft seems previously initialized during
      registration.
      
      This simplifies cftype handling a bit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0adb0704
    • T
      cgroup: warn if "xattr" is specified with "sane_behavior" · 86bf4b68
      Tejun Heo 提交于
      Mount option "xattr" is no longer necessary as it's enabled by default
      on kernfs.  Warn if "xattr" is specified with "sane_behavior" so that
      the option can be removed in the future.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      86bf4b68
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
    • T
      cgroup: misc preps for kernfs conversion · 59f5296b
      Tejun Heo 提交于
      * Un-inline seq_css().  After kernfs conversion, the function will
        need to dereference internal data structures.
      
      * Add cgroup_get/put_root() and replace direct super_block->s_active
        manipulatinos with them.  These will be converted to kernfs_root
        refcnting.
      
      * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with
        them.  These will be converted to kernfs refcnting.
      
      * Update current_css_set_cg_links_read() to use cgroup_name() instead
        of reaching into the dentry name.  The end result is the same.
      
      These changes don't make functional differences but will make
      transition to kernfs easier.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      59f5296b
    • T
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo 提交于
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924
    • T
      cgroup: update the meaning of cftype->max_write_len · 5f469907
      Tejun Heo 提交于
      cftype->max_write_len is used to extend the maximum size of writes.
      It's interpreted in such a way that the actual maximum size is one
      less than the specified value.  The default size is defined by
      CGROUP_LOCAL_BUFFER_SIZE.  Its interpretation is quite confusing - its
      value is decremented by 1 and then compared for equality with max
      size, which means that the actual default size is
      CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.
      
      There's no point in having a limit that low.  Update its definition so
      that it means the actual string length sans termination and anything
      below PAGE_SIZE-1 is treated as PAGE_SIZE-1.
      
      .max_write_len for "release_agent" is updated to PATH_MAX-1 and
      cgroup_release_agent_write() is updated so that the redundant strlen()
      check is removed and it uses strlcpy() instead of strcpy().
      .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
      no longer necessary and removed.  The one in cpuset is kept unchanged
      as it's an approximated value to begin with.
      
      This will also make transition to kernfs smoother.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5f469907
    • T
      cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes() · de00ffa5
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes registration is different from
      dynamic cftypes registartion.  Instead of going through
      cgroup_add_cftypes(), cgroup_init_subsys() invokes
      cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset
      which doesn't involve dynamic allocation.
      
      While avoiding dynamic allocation is somewhat nice, having two
      separate paths for cftypes registration is nasty, especially as we're
      planning to add more operations during cftypes registration.
      
      This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset
      and registers base_cftypes using cgroup_add_cftypes().  This is done
      as a separate step in cgroup_init() instead of a part of
      cgroup_init_subsys().  This is because cgroup_init_subsys() can be
      called very early during boot when kmalloc() isn't available yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      de00ffa5
    • T
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo 提交于
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
  10. 11 2月, 2014 1 次提交
    • L
      cgroup: protect modifications to cgroup_idr with cgroup_mutex · 0ab02ca8
      Li Zefan 提交于
      Setup cgroupfs like this:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mkdir /cgroup/sub1
        # mkdir /cgroup/sub2
      
      Then run these two commands:
        # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
        # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &
      
      After seconds you may see this warning:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
      idr_remove called for id=6 which is not allocated.
      ...
      Call Trace:
       [<ffffffff8156063c>] dump_stack+0x7a/0x96
       [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
       [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff81300aa7>] sub_remove+0x87/0x1b0
       [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
       [<ffffffff81300bf5>] idr_remove+0x25/0xd0
       [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
       [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
       [<ffffffff8107cdbc>] process_one_work+0x26c/0x550
       [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
       [<ffffffff81085f96>] kthread+0xe6/0xf0
       [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
      ---[ end trace 2d1577ec10cf80d0 ]---
      
      It's because allocating/removing cgroup ID is not properly synchronized.
      
      The bug was introduced when we converted cgroup_ida to cgroup_idr.
      While synchronization is already done inside ida_simple_{get,remove}(),
      users are responsible for concurrent calls to idr_{alloc,remove}().
      
      tj: Refreshed on top of b58c8998 ("cgroup: fix error return from
      cgroup_create()").
      
      Fixes: 4e96ee8e ("cgroup: convert cgroup_ida to cgroup_idr")
      Cc: <stable@vger.kernel.org> #3.12+
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ab02ca8
  11. 08 2月, 2014 2 次提交
    • T
      cgroup: rename cgroup_subsys->subsys_id to ->id · aec25020
      Tejun Heo 提交于
      It's no longer referenced outside cgroup core, so renaming is easy.
      Let's rename it for consistency & brevity.
      
      This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      aec25020
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9