1. 25 2月, 2014 3 次提交
    • T
      cgroup: use css_set->mg_tasks to track target tasks during migration · b3dc094e
      Tejun Heo 提交于
      Currently, while migrating tasks from one cgroup to another,
      cgroup_attach_task() builds a flex array of all target tasks;
      unfortunately, this has a couple issues.
      
      * Flex array has size limit.  On 64bit, struct task_and_cgroup is
        24bytes making the flex element limit around 87k.  It is a high
        number but not impossible to hit.  This means that the current
        cgroup implementation can't migrate a process with more than 87k
        threads.
      
      * Process migration involves memory allocation whose size is dependent
        on the number of threads the process has.  This means that cgroup
        core can't guarantee success or failure of multi-process migrations
        as memory allocation failure can happen in the middle.  This is in
        part because cgroup can't grab threadgroup locks of multiple
        processes at the same time, so when there are multiple processes to
        migrate, it is imposible to tell how many tasks are to be migrated
        beforehand.
      
        Note that this already affects cgroup_transfer_tasks().  cgroup
        currently cannot guarantee atomic success or failure of the
        operation.  It may fail in the middle and after such failure cgroup
        doesn't have enough information to roll back properly.  It just
        aborts with some tasks migrated and others not.
      
      To resolve the situation, this patch updates the migration path to use
      task->cg_list to track target tasks.  The previous patch already added
      css_set->mg_tasks and updated iterations in non-migration paths to
      include them during task migration.  This patch updates migration path
      to actually make use of it.
      
      Instead of putting onto a flex_array, each target task is moved from
      its css_set->tasks list to css_set->mg_tasks and the migration path
      keeps trace of all the source css_sets and the associated cgroups.
      Once all source css_sets are determined, the destination css_set for
      each is determined, linked to the matching source css_set and put on a
      separate list.
      
      To iterate the target tasks, migration path just needs to iterat
      through either the source or target css_sets, depending on whether
      migration has been committed or not, and the tasks on their ->mg_tasks
      lists.  cgroup_taskset is updated to contain the list_heads for source
      and target css_sets and the iteration cursor.  cgroup_taskset_*() are
      accordingly updated to walk through css_sets and their ->mg_tasks.
      
      This resolves the above listed issues with moderate additional
      complexity.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b3dc094e
    • T
      cgroup: add css_set->mg_tasks · c7561128
      Tejun Heo 提交于
      Currently, while migrating tasks from one cgroup to another,
      cgroup_attach_task() builds a flex array of all target tasks;
      unfortunately, this has a couple issues.
      
      * Flex array has size limit.  On 64bit, struct task_and_cgroup is
        24bytes making the flex element limit around 87k.  It is a high
        number but not impossible to hit.  This means that the current
        cgroup implementation can't migrate a process with more than 87k
        threads.
      
      * Process migration involves memory allocation whose size is dependent
        on the number of threads the process has.  This means that cgroup
        core can't guarantee success or failure of multi-process migrations
        as memory allocation failure can happen in the middle.  This is in
        part because cgroup can't grab threadgroup locks of multiple
        processes at the same time, so when there are multiple processes to
        migrate, it is imposible to tell how many tasks are to be migrated
        beforehand.
      
        Note that this already affects cgroup_transfer_tasks().  cgroup
        currently cannot guarantee atomic success or failure of the
        operation.  It may fail in the middle and after such failure cgroup
        doesn't have enough information to roll back properly.  It just
        aborts with some tasks migrated and others not.
      
      To resolve the situation, we're going to use task->cg_list during
      migration too.  Instead of building a separate array, target tasks
      will be linked into a dedicated migration list_head on the owning
      css_set.  Tasks on the migration list are treated the same as tasks on
      the usual tasks list; however, being on a separate list allows cgroup
      migration code path to keep track of the target tasks by simply
      keeping the list of css_sets with tasks being migrated, making
      unpredictable dynamic allocation unnecessary.
      
      In prepartion of such migration path update, this patch introduces
      css_set->mg_tasks list and updates css_set task iterations so that
      they walk both css_set->tasks and ->mg_tasks.  Note that ->mg_tasks
      isn't used yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c7561128
    • T
      Merge branch 'cgroup/for-3.14-fixes' into cgroup/for-3.15 · f153ad11
      Tejun Heo 提交于
      Pull in for-3.14-fixes to receive 532de3fc ("cgroup: update
      cgroup_enable_task_cg_lists() to grab siglock") which conflicts with
      afeb0f9f ("cgroup: relocate cgroup_enable_task_cg_lists()") and
      the following cg_lists updates.  This is likely to cause further
      conflicts down the line too, so let's merge it early.
      
      As cgroup_enable_task_cg_lists() is relocated in for-3.15, this merge
      causes conflict in the original position.  It's resolved by applying
      siglock changes to the updated version in the new location.
      
      Conflicts:
      	kernel/cgroup.c
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f153ad11
  2. 19 2月, 2014 2 次提交
    • T
      cgroup: update cgroup_enable_task_cg_lists() to grab siglock · 532de3fc
      Tejun Heo 提交于
      Currently, there's nothing preventing cgroup_enable_task_cg_lists()
      from missing set PF_EXITING and race against cgroup_exit().  Depending
      on the timing, cgroup_exit() may finish with the task still linked on
      css_set leading to list corruption.  Fix it by grabbing siglock in
      cgroup_enable_task_cg_lists() so that PF_EXITING is guaranteed to be
      visible.
      
      This whole on-demand cg_list optimization is extremely fragile and has
      ample possibility to lead to bugs which can cause things like
      once-a-year oops during boot.  I'm wondering whether the better
      approach would be just adding "cgroup_disable=all" handling which
      disables the whole cgroup rather than tempting fate with this
      on-demand craziness.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      532de3fc
    • L
      cgroup: add a validation check to cgroup_add_cftyps() · dc5736ed
      Li Zefan 提交于
      Fengguang reported this bug:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000003c
      IP: [<cc90b4ad>] cgroup_cfts_commit+0x27/0x1c1
      ...
      Call Trace:
        [<cc9d1129>] ? kmem_cache_alloc_trace+0x33f/0x3b7
        [<cc90c6fc>] cgroup_add_cftypes+0x8f/0xca
        [<cd78b646>] cgroup_init+0x6a/0x26a
        [<cd764d7d>] start_kernel+0x4d7/0x57a
        [<cd7642ef>] i386_start_kernel+0x92/0x96
      
      This happens in a corner case. If CGROUP_SCHED=y but CFS_BANDWIDTH=n &&
      FAIR_GROUP_SCHED=n && RT_GROUP_SCHED=n, we have:
      
      cpu_files[] = {
      	{ }	/* terminate */
      }
      
      When we pass cpu_files to cgroup_apply_cftypes(), as cpu_files[0].ss
      is NULL, we'll access NULL pointer.
      
      The bug was introduced by commit de00ffa5
      ("cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes()").
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      dc5736ed
  3. 14 2月, 2014 5 次提交
  4. 13 2月, 2014 18 次提交
    • T
      cgroup: unexport functions · 8541fecc
      Tejun Heo 提交于
      With module support gone, a lot of functions no longer need to be
      exported.  Unexport them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8541fecc
    • T
      cgroup: cosmetic updates to cgroup_attach_task() · 9db8de37
      Tejun Heo 提交于
      cgroup_attach_task() is planned to go through restructuring.  Let's
      tidy it up a bit in preparation.
      
      * Update cgroup_attach_task() to receive the target task argument in
        @leader instead of @tsk.
      
      * Rename @tsk to @task.
      
      * Rename @retval to @ret.
      
      This is purely cosmetic.
      
      v2: get_nr_threads() was using uninitialized @task instead of @leader.
          Fixed.  Reported by Dan Carpenter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      9db8de37
    • T
      cgroup: remove cgroup_taskset_cur_css() and cgroup_taskset_size() · bc668c75
      Tejun Heo 提交于
      The two functions don't have any users left.  Remove them along with
      cgroup_taskset->cur_cgrp.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      bc668c75
    • T
      cpuset: don't use cgroup_taskset_cur_css() · 57fce0a6
      Tejun Heo 提交于
      cgroup_taskset_cur_css() will be removed during the planned
      resturcturing of migration path.  The only use of
      cgroup_taskset_cur_css() is finding out the old cgroup_subsys_state of
      the leader in cpuset_attach().  This usage can easily be removed by
      remembering the old value from cpuset_can_attach().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      57fce0a6
    • T
      cgroup: drop @skip_css from cgroup_taskset_for_each() · 924f0d9a
      Tejun Heo 提交于
      If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching
      css.  The intention of the interface is to make it easy to skip css's
      (cgroup_subsys_states) which already match the migration target;
      however, this is entirely unnecessary as migration taskset doesn't
      include tasks which are already in the target cgroup.  Drop @skip_css
      from cgroup_taskset_for_each().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Daniel Borkmann <dborkman@redhat.com>
      924f0d9a
    • T
      cgroup: move css_set_rwsem locking outside of cgroup_task_migrate() · cb0f1fe9
      Tejun Heo 提交于
      Instead of repeatedly locking and unlocking css_set_rwsem inside
      cgroup_task_migrate(), update cgroup_attach_task() to grab it outside
      of the loop and update cgroup_task_migrate() to use
      put_css_set_locked().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      cb0f1fe9
    • T
      cgroup: separate out put_css_set_locked() and remove put_css_set_taskexit() · 89c5509b
      Tejun Heo 提交于
      put_css_set() is performed in two steps - it first tries to put
      without grabbing css_set_rwsem if such put wouldn't make the count
      zero.  If that fails, it puts after write-locking css_set_rwsem.  This
      patch separates out the second phase into put_css_set_locked() which
      should be called with css_set_rwsem locked.
      
      Also, put_css_set_taskexit() is droped and put_css_set() is made to
      take @taskexit.  There are only a handful users of these functions.
      No point in providing different variants.
      
      put_css_locked() will be used by later changes.  This patch doesn't
      introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      89c5509b
    • T
      cgroup: remove css_scan_tasks() · 889ed9ce
      Tejun Heo 提交于
      css_scan_tasks() doesn't have any user left.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      889ed9ce
    • T
      cpuset: use css_task_iter_start/next/end() instead of css_scan_tasks() · d66393e5
      Tejun Heo 提交于
      Now that css_task_iter_start/next_end() supports blocking while
      iterating, there's no reason to use css_scan_tasks() which is more
      cumbersome to use and scheduled to be removed.
      
      Convert all css_scan_tasks() usages in cpuset to
      css_task_iter_start/next/end().  This simplifies the code by removing
      heap allocation and callbacks.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d66393e5
    • T
      cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem · 96d365e0
      Tejun Heo 提交于
      Currently there are two ways to walk tasks of a cgroup -
      css_task_iter_start/next/end() and css_scan_tasks().  The latter
      builds on the former but allows blocking while iterating.
      Unfortunately, the way css_scan_tasks() is implemented is rather
      nasty, it uses a priority heap of pointers to extract some number of
      tasks in task creation order and loops over them invoking the callback
      and repeats that until it reaches the end.  It requires either
      preallocated heap or may fail under memory pressure, while unlikely to
      be problematic, the complexity is O(N^2), and in general just nasty.
      
      We're gonna convert all css_scan_users() to
      css_task_iter_start/next/end() and remove css_scan_users().  As
      css_scan_tasks() users may block, let's convert css_set_lock to a
      rwsem so that tasks can block during css_task_iter_*() is in progress.
      
      While this does increase the chance of possible deadlock scenarios,
      given the current usage, the probability is relatively low, and even
      if that happens, the right thing to do is updating the iteration in
      the similar way to css iterators so that it can handle blocking.
      
      Most conversions are trivial; however, task_cgroup_path() now expects
      to be called with css_set_rwsem locked instead of locking itself.
      This is because the function is called with RCU read lock held and
      rwsem locking should nest outside RCU read lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      96d365e0
    • T
      cgroup: reimplement cgroup_transfer_tasks() without using css_scan_tasks() · e406d1cf
      Tejun Heo 提交于
      Reimplement cgroup_transfer_tasks() so that it repeatedly fetches the
      first task in the cgroup and then tranfers it.  This achieves the same
      result without using css_scan_tasks() which is scheduled to be
      removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e406d1cf
    • T
      cgroup: implement cgroup_has_tasks() and unexport cgroup_task_count() · 07bc356e
      Tejun Heo 提交于
      cgroup_task_count() read-locks css_set_lock and walks all tasks to
      count them and then returns the result.  The only thing all the users
      want is determining whether the cgroup is empty or not.  This patch
      implements cgroup_has_tasks() which tests whether cgroup->cset_links
      is empty, replaces all cgroup_task_count() usages and unexports it.
      
      Note that the test isn't synchronized.  This is the same as before.
      The test has always been racy.
      
      This will help planned css_set locking update.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      07bc356e
    • T
      cgroup: relocate cgroup_enable_task_cg_lists() · afeb0f9f
      Tejun Heo 提交于
      Move it above so that prototype isn't necessary.  Let's also move the
      definition of use_task_css_set_links next to it.
      
      This is purely cosmetic.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      afeb0f9f
    • T
      cgroup: enable task_cg_lists on the first cgroup mount · 56fde9e0
      Tejun Heo 提交于
      Tasks are not linked on their css_sets until cgroup task iteration is
      actually used.  This is to avoid incurring overhead on the fork and
      exit paths for systems which have cgroup compiled in but don't use it.
           
      This lazy binding also affects the task migration path.  It has to be
      careful so that it doesn't link tasks to css_sets when task_cg_lists
      linking is not enabled yet.  Unfortunately, this conditional linking
      in the migration path interferes with planned migration updates.
      
      This patch moves the lazy binding a bit earlier, to the first cgroup
      mount.  It's a clear indication that cgroup is being used on the
      system and task_cg_lists linking is highly likely to be enabled soon
      anyway through "tasks" and "cgroup.procs" files.
      
      This allows cgroup_task_migrate() to always link @tsk->cg_list.  Note
      that it may still race with cgroup_post_fork() but who wins that race
      is inconsequential.
      
      While at it, make use_task_css_set_links a bool, add sanity checks in
      cgroup_enable_task_cg_lists() and css_task_iter_start(), and update
      the former so that it's guaranteed and assumes to run only once.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      56fde9e0
    • T
      cgroup: drop CGRP_ROOT_SUBSYS_BOUND · 35585573
      Tejun Heo 提交于
      Before kernfs conversion, due to the way super_block lookup works,
      cgroup roots were created and made visible before being fully
      initialized.  This in turn required a special flag to mark that the
      root hasn't been fully initialized so that the destruction path can
      tell fully bound ones from half initialized.
      
      That flag is CGRP_ROOT_SUBSYS_BOUND and no longer necessary after the
      kernfs conversion as the lookup and creation of new root are atomic
      w.r.t. cgroup_mutex.  This patch removes the flag and passes the
      requests subsystem mask to cgroup_setup_root() so that it can set the
      respective mask bits as subsystems are bound.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      35585573
    • T
      cgroup: disallow xattr, release_agent and name if sane_behavior · d3ba07c3
      Tejun Heo 提交于
      Disallow more mount options if sane_behavior.  Note that xattr used to
      generate warning.
      
      While at it, simplify option check in cgroup_mount() and update
      sane_behavior comment in cgroup.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d3ba07c3
    • T
      Revert "cgroup: use an ordered workqueue for cgroup destruction" · 1a11533f
      Tejun Heo 提交于
      This reverts commit ab3f5faa.
      Explanation from Hugh:
      
        It's because more thorough testing, by others here, found that it
        wasn't always solving the problem: so I asked Tejun privately to
        hold off from sending it in, until we'd worked out why not.
      
        Most of our testing being on a v3,11-based kernel, it was perfectly
        possible that the problem was merely our own e.g. missing Tejun's
        8a2b7538 ("workqueue: fix ordered workqueues in NUMA setups").
      
        But that turned out not to be enough to fix it either. Then Filipe
        pointed out how percpu_ref_kill_and_confirm() uses call_rcu_sched()
        before we ever get to put the offline on to the workqueue: by the
        time we get to the workqueue, the ordering has already been lost.
      
        So, thanks for the Acks, but I'm afraid that this ordered workqueue
        solution is just not good enough: we should simply forget that patch
        and provide a different answer."
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      1a11533f
    • S
      sun4M: add include of slab.h for kzalloc · a755180b
      Stephen Rothwell 提交于
      This was being included implicitly via cgroup.h's inclusion of xattr.h
      (which has now been removed).
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NSam Ravnborg <sam@ravnborg.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a755180b
  5. 12 2月, 2014 12 次提交
    • T
      cgroup: remove cgroupfs_root->refcnt · 776f02fa
      Tejun Heo 提交于
      Currently, cgroupfs_root and its ->top_cgroup are separated reference
      counted and the latter's is ignored.  There's no reason to do this
      separately.  This patch removes cgroupfs_root->refcnt and destroys
      cgroupfs_root when the top_cgroup is released.
      
      * cgroup_put() updated to ignore cgroup_is_dead() test for top
        cgroups.  cgroup_free_fn() updated to handle root destruction when
        releasing a top cgroup.
      
      * As root destruction is now bounced through cgroup destruction, it is
        asynchronous.  Update cgroup_mount() so that it waits for pending
        release which is currently implemented using msleep().  Converting
        this to proper wait_queue isn't hard but likely unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      776f02fa
    • T
      cgroup: rename cgroupfs_root->number_of_cgroups to ->nr_cgrps and make it atomic_t · 3c9c825b
      Tejun Heo 提交于
      root->number_of_cgroups is currently an integer protected with
      cgroup_mutex.  Except for sanity checks and proc reporting, the only
      place it's used is to check whether the root has any child during
      remount; however, this is a bit flawed as the counter is not
      decremented when the cgroup is unlinked but when it's released,
      meaning that there could be an extended period where all cgroups are
      removed but remount is still not allowed because some internal objects
      are lingering.  While not perfect either, it'd be better to use
      emptiness test on root->top_cgroup.children.
      
      This patch updates cgroup_remount() to test top_cgroup's children
      instead, which makes number_of_cgroups only actual usage statistics
      printing in proc implemented in proc_cgroupstats_show().  Let's
      shorten its name and make it an atomic_t so that we don't have to
      worry about its synchronization.  It's purely auxiliary at this point.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3c9c825b
    • T
      cgroup: remove cgroup->name · e61734c5
      Tejun Heo 提交于
      cgroup->name handling became quite complicated over time involving
      dedicated struct cgroup_name for RCU protection.  Now that cgroup is
      on kernfs, we can drop all of it and simply use kernfs_name/path() and
      friends.  Replace cgroup->name and all related code with kernfs
      name/path constructs.
      
      * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top
        of kernfs counterparts, which involves semantic changes.
        pr_cont_cgroup_name() and pr_cont_cgroup_path() added.
      
      * cgroup->name handling dropped from cgroup_rename().
      
      * All users of cgroup_name/path() updated to the new semantics.  Users
        which were formatting the string just to printk them are converted
        to use pr_cont_cgroup_name/path() instead, which simplifies things
        quite a bit.  As cgroup_name() no longer requires RCU read lock
        around it, RCU lockings which were protecting only cgroup_name() are
        removed.
      
      v2: Comment above oom_info_lock updated as suggested by Michal.
      
      v3: dummy_top doesn't have a kn associated and
          pr_cont_cgroup_name/path() ended up calling the matching kernfs
          functions with NULL kn leading to oops.  Test for NULL kn and
          print "/" if so.  This issue was reported by Fengguang Wu.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      e61734c5
    • T
      cgroup: make cgroup hold onto its kernfs_node · 6f30558f
      Tejun Heo 提交于
      cgroup currently releases its kernfs_node when it gets removed.  While
      not buggy, this makes cgroup->kn access rules complicated than
      necessary and leads to things like get/put protection around
      kernfs_remove() in cgroup_destroy_locked().  In addition, we want to
      use kernfs_name/path() and friends but also want to be able to
      determine a cgroup's name between removal and release.
      
      This patch makes cgroup hold onto its kernfs_node until freed so that
      cgroup->kn is always accessible.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6f30558f
    • T
      cgroup: simplify dynamic cftype addition and removal · 21a2d343
      Tejun Heo 提交于
      Dynamic cftype addition and removal using cgroup_add/rm_cftypes()
      respectively has been quite hairy due to vfs i_mutex.  As i_mutex
      nests outside cgroup_mutex, cgroup_mutex has to be released and
      regrabbed on each iteration through the hierarchy complicating the
      process.  Now that i_mutex is no longer in play, it can be simplified.
      
      * Just holding cgroup_tree_mutex is enough.  No need to meddle with
        cgroup_mutex.
      
      * No reason to play the unlock - relock - check serial_nr dancing.
        Everything can be atomically while holding cgroup_tree_mutex.
      
      * cgroup_cfts_prepare() is replaced with direct locking of
        cgroup_tree_mutex.
      
      * cgroup_cfts_commit() no longer fiddles with locking.  It just
        applies the cftypes change to the existing cgroups in the hierarchy.
        Renamed to cgroup_cfts_apply().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      21a2d343
    • T
      cgroup: remove cftype_set · 0adb0704
      Tejun Heo 提交于
      cftype_set was added primarily to allow registering the same cftype
      array more than once for different subsystems.  Nobody uses or needs
      such thing and it's already broken because each cftype has ->ss
      pointer which is initialized during registration.
      
      Let's add list_head ->node to cftype and use the first cftype entry in
      the array to link them instead of allocating separate cftype_set.
      While at it, trigger WARN if cft seems previously initialized during
      registration.
      
      This simplifies cftype handling a bit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0adb0704
    • T
      cgroup: relocate cgroup_rm_cftypes() · 80b13586
      Tejun Heo 提交于
      cftype handling is about to be revamped.  Relocate cgroup_rm_cftypes()
      above cgroup_add_cftypes() in preparation.  This is pure relocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      80b13586
    • T
      cgroup: warn if "xattr" is specified with "sane_behavior" · 86bf4b68
      Tejun Heo 提交于
      Mount option "xattr" is no longer necessary as it's enabled by default
      on kernfs.  Warn if "xattr" is specified with "sane_behavior" so that
      the option can be removed in the future.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      86bf4b68
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
    • T
      cgroup: relocate functions in preparation of kernfs conversion · f2e85d57
      Tejun Heo 提交于
      Relocate cgroup_init/exit_root_id(), cgroup_free_root(),
      cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs
      conversion.
      
      These are pure relocations to make kernfs conversion easier to follow.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f2e85d57
    • T
      cgroup: misc preps for kernfs conversion · 59f5296b
      Tejun Heo 提交于
      * Un-inline seq_css().  After kernfs conversion, the function will
        need to dereference internal data structures.
      
      * Add cgroup_get/put_root() and replace direct super_block->s_active
        manipulatinos with them.  These will be converted to kernfs_root
        refcnting.
      
      * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with
        them.  These will be converted to kernfs refcnting.
      
      * Update current_css_set_cg_links_read() to use cgroup_name() instead
        of reaching into the dentry name.  The end result is the same.
      
      These changes don't make functional differences but will make
      transition to kernfs easier.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      59f5296b
    • T
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo 提交于
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924