1. 12 2月, 2014 14 次提交
    • T
      cgroup: warn if "xattr" is specified with "sane_behavior" · 86bf4b68
      Tejun Heo 提交于
      Mount option "xattr" is no longer necessary as it's enabled by default
      on kernfs.  Warn if "xattr" is specified with "sane_behavior" so that
      the option can be removed in the future.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      86bf4b68
    • T
      cgroup: convert to kernfs · 2bd59d48
      Tejun Heo 提交于
      cgroup filesystem code was derived from the original sysfs
      implementation which was heavily intertwined with vfs objects and
      locking with the goal of re-using the existing vfs infrastructure.
      That experiment turned out rather disastrous and sysfs switched, a
      long time ago, to distributed filesystem model where a separate
      representation is maintained which is queried by vfs.  Unfortunately,
      cgroup stuck with the failed experiment all these years and
      accumulated even more problems over time.
      
      Locking and object lifetime management being entangled with vfs is
      probably the most egregious.  vfs is never designed to be misused like
      this and cgroup ends up jumping through various convoluted dancing to
      make things work.  Even then, operations across multiple cgroups can't
      be done safely as it'll deadlock with rename locking.
      
      Recently, kernfs is separated out from sysfs so that it can be used by
      users other than sysfs.  This patch converts cgroup to use kernfs,
      which will bring the following benefits.
      
      * Separation from vfs internals.  Locking and object lifetime
        management is contained in cgroup proper making things a lot
        simpler.  This removes significant amount of locking convolutions,
        hairy object lifetime rules and the restriction on multi-cgroup
        operations.
      
      * Can drop a lot of code to implement filesystem interface as most are
        provided by kernfs.
      
      * Proper "severing" semantics, which allows controllers to not worry
        about lingering file accesses after offline.
      
      While the preceding patches did as much as possible to make the
      transition less painful, large part of the conversion has to be one
      discrete step making this patch rather large.  The rest of the commit
      message lists notable changes in different areas.
      
      Overall
      -------
      
      * vfs constructs replaced with kernfs ones.  cgroup->dentry w/ ->kn,
        cgroupfs_root->sb w/ ->kf_root.
      
      * All dentry accessors are removed.  Helpers to map from kernfs
        constructs are added.
      
      * All vfs plumbing around dentry, inode and bdi removed.
      
      * cgroup_mount() now directly looks for matching root and then
        proceeds to create a new one if not found.
      
      Synchronization and object lifetime
      -----------------------------------
      
      * vfs inode locking removed.  Among other things, this removes the
        need for the convolution in cgroup_cfts_commit().  Future patches
        will further simplify it.
      
      * vfs refcnting replaced with cgroup internal ones.  cgroup->refcnt,
        cgroupfs_root->refcnt added.  cgroup_put_root() now directly puts
        root->refcnt and when it reaches zero proceeds to destroy it thus
        merging cgroup_put_root() and the former cgroup_kill_sb().
        Simliarly, cgroup_put() now directly schedules cgroup_free_rcu()
        when refcnt reaches zero.
      
      * Unlike before, kernfs objects don't hold onto cgroup objects.  When
        cgroup destroys a kernfs node, all existing operations are drained
        and the association is broken immediately.  The same for
        cgroupfs_roots and mounts.
      
      * All operations which come through kernfs guarantee that the
        associated cgroup is and stays valid for the duration of operation;
        however, there are two paths which need to find out the associated
        cgroup from dentry without going through kernfs -
        css_tryget_from_dir() and cgroupstats_build().  For these two,
        kernfs_node->priv is RCU managed so that they can dereference it
        under RCU read lock.
      
      File and directory handling
      ---------------------------
      
      * File and directory operations converted to kernfs_ops and
        kernfs_syscall_ops.
      
      * xattrs is implicitly supported by kernfs.  No need to worry about it
        from cgroup.  This means that "xattr" mount option is no longer
        necessary.  A future patch will add a deprecated warning message
        when sane_behavior.
      
      * When cftype->max_write_len > PAGE_SIZE, it's necessary to make a
        private copy of one of the kernfs_ops to set its atomic_write_len.
        cftype->kf_ops is added and cgroup_init/exit_cftypes() are updated
        to handle it.
      
      * cftype->lockdep_key added so that kernfs lockdep annotation can be
        per cftype.
      
      * Inidividual file entries and open states are now managed by kernfs.
        No need to worry about them from cgroup.  cfent, cgroup_open_file
        and their friends are removed.
      
      * kernfs_nodes are created deactivated and kernfs_activate()
        invocations added to places where creation of new nodes are
        committed.
      
      * cgroup_rmdir() uses kernfs_[un]break_active_protection() for
        self-removal.
      
      v2: - Li pointed out in an earlier patch that specifying "name="
            during mount without subsystem specification should succeed if
            there's an existing hierarchy with a matching name although it
            should fail with -EINVAL if a new hierarchy should be created.
            Prior to the conversion, this used by handled by deferring
            failure from NULL return from cgroup_root_from_opts(), which was
            necessary because root was being created before checking for
            existing ones.  Note that cgroup_root_from_opts() returned an
            ERR_PTR() value for error conditions which require immediate
            mount failure.
      
            As we now have separate search and creation steps, deferring
            failure from cgroup_root_from_opts() is no longer necessary.
            cgroup_root_from_opts() is updated to always return ERR_PTR()
            value on failure.
      
          - The logic to match existing roots is updated so that a mount
            attempt with a matching name but different subsys_mask are
            rejected.  This was handled by a separate matching loop under
            the comment "Check for name clashes with existing mounts" but
            got lost during conversion.  Merge the check into the main
            search loop.
      
          - Add __rcu __force casting in RCU_INIT_POINTER() in
            cgroup_destroy_locked() to avoid the sparse address space
            warning reported by kbuild test bot.  Maybe we want an explicit
            interface to use kn->priv as RCU protected pointer?
      
      v3: Make CONFIG_CGROUPS select CONFIG_KERNFS.
      
      v4: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot fengguang.wu@intel.com>
      2bd59d48
    • T
      cgroup: relocate functions in preparation of kernfs conversion · f2e85d57
      Tejun Heo 提交于
      Relocate cgroup_init/exit_root_id(), cgroup_free_root(),
      cgroup_kill_sb() and cgroup_file_name() in preparation of kernfs
      conversion.
      
      These are pure relocations to make kernfs conversion easier to follow.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      f2e85d57
    • T
      cgroup: misc preps for kernfs conversion · 59f5296b
      Tejun Heo 提交于
      * Un-inline seq_css().  After kernfs conversion, the function will
        need to dereference internal data structures.
      
      * Add cgroup_get/put_root() and replace direct super_block->s_active
        manipulatinos with them.  These will be converted to kernfs_root
        refcnting.
      
      * Add cgroup_get/put() and replace dget/put() on cgrp->dentry with
        them.  These will be converted to kernfs refcnting.
      
      * Update current_css_set_cg_links_read() to use cgroup_name() instead
        of reaching into the dentry name.  The end result is the same.
      
      These changes don't make functional differences but will make
      transition to kernfs easier.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      59f5296b
    • T
      cgroup: introduce cgroup_ino() · b1664924
      Tejun Heo 提交于
      mm/memory-failure.c::hwpoison_filter_task() has been reaching into
      cgroup to extract the associated ino to be used as a filtering
      criterion.  This is an implementation detail which shouldn't be
      depended upon from outside cgroup proper and is about to change with
      the scheduled kernfs conversion.
      
      This patch introduces a proper interface to determine the associated
      ino, cgroup_ino(), and updates hwpoison_filter_task() to use it
      instead of reaching directly into cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      b1664924
    • T
      cgroup: introduce cgroup_init/exit_cftypes() · 2da440a2
      Tejun Heo 提交于
      Factor out cft->ss initialization into cgroup_init_cftypes() from
      cgroup_add_cftypes() and add cft->ss clearing to cgroup_rm_cftypes()
      through cgroup_exit_cftypes().
      
      This doesn't make any meaningful difference now but the two new
      functions will be expanded during kernfs transition.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2da440a2
    • T
      cgroup: update the meaning of cftype->max_write_len · 5f469907
      Tejun Heo 提交于
      cftype->max_write_len is used to extend the maximum size of writes.
      It's interpreted in such a way that the actual maximum size is one
      less than the specified value.  The default size is defined by
      CGROUP_LOCAL_BUFFER_SIZE.  Its interpretation is quite confusing - its
      value is decremented by 1 and then compared for equality with max
      size, which means that the actual default size is
      CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.
      
      There's no point in having a limit that low.  Update its definition so
      that it means the actual string length sans termination and anything
      below PAGE_SIZE-1 is treated as PAGE_SIZE-1.
      
      .max_write_len for "release_agent" is updated to PATH_MAX-1 and
      cgroup_release_agent_write() is updated so that the redundant strlen()
      check is removed and it uses strlcpy() instead of strcpy().
      .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
      no longer necessary and removed.  The one in cpuset is kept unchanged
      as it's an approximated value to begin with.
      
      This will also make transition to kernfs smoother.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5f469907
    • T
      cgroup: make cgroup_subsys->base_cftypes use cgroup_add_cftypes() · de00ffa5
      Tejun Heo 提交于
      Currently, cgroup_subsys->base_cftypes registration is different from
      dynamic cftypes registartion.  Instead of going through
      cgroup_add_cftypes(), cgroup_init_subsys() invokes
      cgroup_init_cftsets() which makes use of cgroup_subsys->base_cftset
      which doesn't involve dynamic allocation.
      
      While avoiding dynamic allocation is somewhat nice, having two
      separate paths for cftypes registration is nasty, especially as we're
      planning to add more operations during cftypes registration.
      
      This patch drops cgroup_init_cftsets() and cgroup_subsys->base_cftset
      and registers base_cftypes using cgroup_add_cftypes().  This is done
      as a separate step in cgroup_init() instead of a part of
      cgroup_init_subsys().  This is because cgroup_init_subsys() can be
      called very early during boot when kmalloc() isn't available yet.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      de00ffa5
    • T
      cgroup: update cgroup name handling · 8d7e6fb0
      Tejun Heo 提交于
      Straightforward updates to cgroup name handling in preparation of
      kernfs conversion.
      
      * cgroup_alloc_name() is updated to take const char * isntead of
        dentry * for name source.
      
      * cgroup name formatting is separated out into cgroup_file_name().
        While at it, buffer length protection is added.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8d7e6fb0
    • T
      cgroup: factor out cgroup_setup_root() from cgroup_mount() · d427dfeb
      Tejun Heo 提交于
      Factor out new root initialization into cgroup_setup_root() from
      cgroup_mount().  This makes it easier to follow and will ease kernfs
      conversion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      d427dfeb
    • T
      cgroup: restructure locking and error handling in cgroup_mount() · 8e30e2b8
      Tejun Heo 提交于
      cgroup is scheduled to be converted to kernfs.  After conversion,
      cgroup_mount() won't use the sget() machinery for finding out existing
      super_blocks but instead would do that directly.  It'll search the
      existing cgroupfs_roots for a matching one and create a new one iff a
      match doesn't exist.  To ease such conversion, this patch restructures
      locking and error handling of the function.
      
      cgroup_tree_mutex and cgroup_mutex are grabbed from the get-go and
      held until return.  For now, due to the way vfs locks nest outside
      cgroup mutexes, the two cgroup mutexes are temporarily dropped across
      sget() and inode mutex locking, which looks quite ridiculous; however,
      these will be removed through kernfs conversion and structuring the
      code this way makes the conversion less painful.
      
      The error goto labels are consolidated to two.  This looks unwieldy
      now but the next patch will factor out creation of new root into a
      separate function with accompanying error handling and it'll look a
      lot better.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8e30e2b8
    • T
      cgroup: release cgroup_mutex over file removals · 4ac06017
      Tejun Heo 提交于
      Now that cftypes and all tree modification operations are protected by
      cgroup_tree_mutex, we can drop cgroup_mutex while deleting files and
      directories.  Drop cgroup_mutex over removals.
      
      This doesn't make any noticeable difference now but is to help kernfs
      conversion.  In kernfs, removals are sync points which drain in-flight
      operations as those operations would grab cgroup_mutex, trying to
      delete under cgroup_mutex would deadlock.  This can be resolved by
      just holding the outer cgroup_tree_mutex which nests outside both
      kernfs active reference and cgroup_mutex.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4ac06017
    • T
      cgroup: introduce cgroup_tree_mutex · ace2bee8
      Tejun Heo 提交于
      Currently cgroup uses combination of inode->i_mutex'es and
      cgroup_mutex for synchronization.  With the scheduled kernfs
      conversion, i_mutex'es will be removed.  Unfortunately, just using
      cgroup_mutex isn't possible.  All kernfs file and syscall operations,
      most of which require grabbing cgroup_mutex, will be called with
      kernfs active ref held and, if we try to perform kernfs removals under
      cgroup_mutex, it can deadlock as kernfs_remove() tries to drain the
      target node.
      
      Let's introduce a new outer mutex, cgroup_tree_mutex, which protects
      stuff used during hierarchy changing operations - cftypes and all the
      operations which may affect the cgroupfs.  It also covers css
      association and iteration.  This allows cgroup_css(), for_each_css()
      and other css iterators to be called under cgroup_tree_mutex.  The new
      mutex will nest above both kernfs's active ref protection and
      cgroup_mutex.  By protecting tree modifications with a separate outer
      mutex, we can get rid of the forementioned deadlock condition.
      
      Actual file additions and removals now require cgroup_tree_mutex
      instead of cgroup_mutex.  Currently, cgroup_tree_mutex is never used
      without cgroup_mutex; however, we'll soon add hierarchy modification
      sections which are only protected by cgroup_tree_mutex.  In the
      future, we might want to make the locking more granular by better
      splitting the coverages of the two mutexes.  For now, this should do.
      
      v2: Rebased on top of 0ab02ca8 ("cgroup: protect modifications to
          cgroup_idr with cgroup_mutex").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ace2bee8
    • T
      cgroup: improve css_from_dir() into css_tryget_from_dir() · 5a17f543
      Tejun Heo 提交于
      css_from_dir() returns the matching css (cgroup_subsys_state) given a
      dentry and subsystem.  The function doesn't pin the css before
      returning and requires the caller to be holding RCU read lock or
      cgroup_mutex and handling pinning on the caller side.
      
      Given that users of the function are likely to want to pin the
      returned css (both existing users do) and that getting and putting
      css's are very cheap, there's no reason for the interface to be tricky
      like this.
      
      Rename css_from_dir() to css_tryget_from_dir() and make it try to pin
      the found css and return it only if pinning succeeded.  The callers
      are updated so that they no longer do RCU locking and pinning around
      the function and just use the returned css.
      
      This will also ease converting cgroup to kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      5a17f543
  2. 11 2月, 2014 1 次提交
    • L
      cgroup: protect modifications to cgroup_idr with cgroup_mutex · 0ab02ca8
      Li Zefan 提交于
      Setup cgroupfs like this:
        # mount -t cgroup -o cpuacct xxx /cgroup
        # mkdir /cgroup/sub1
        # mkdir /cgroup/sub2
      
      Then run these two commands:
        # for ((; ;)) { mkdir /cgroup/sub1/tmp && rmdir /mnt/sub1/tmp; } &
        # for ((; ;)) { mkdir /cgroup/sub2/tmp && rmdir /mnt/sub2/tmp; } &
      
      After seconds you may see this warning:
      
      ------------[ cut here ]------------
      WARNING: CPU: 1 PID: 25243 at lib/idr.c:527 sub_remove+0x87/0x1b0()
      idr_remove called for id=6 which is not allocated.
      ...
      Call Trace:
       [<ffffffff8156063c>] dump_stack+0x7a/0x96
       [<ffffffff810591ac>] warn_slowpath_common+0x8c/0xc0
       [<ffffffff81059296>] warn_slowpath_fmt+0x46/0x50
       [<ffffffff81300aa7>] sub_remove+0x87/0x1b0
       [<ffffffff810f3f02>] ? css_killed_work_fn+0x32/0x1b0
       [<ffffffff81300bf5>] idr_remove+0x25/0xd0
       [<ffffffff810f2bab>] cgroup_destroy_css_killed+0x5b/0xc0
       [<ffffffff810f4000>] css_killed_work_fn+0x130/0x1b0
       [<ffffffff8107cdbc>] process_one_work+0x26c/0x550
       [<ffffffff8107eefe>] worker_thread+0x12e/0x3b0
       [<ffffffff81085f96>] kthread+0xe6/0xf0
       [<ffffffff81570bac>] ret_from_fork+0x7c/0xb0
      ---[ end trace 2d1577ec10cf80d0 ]---
      
      It's because allocating/removing cgroup ID is not properly synchronized.
      
      The bug was introduced when we converted cgroup_ida to cgroup_idr.
      While synchronization is already done inside ida_simple_{get,remove}(),
      users are responsible for concurrent calls to idr_{alloc,remove}().
      
      tj: Refreshed on top of b58c8998 ("cgroup: fix error return from
      cgroup_create()").
      
      Fixes: 4e96ee8e ("cgroup: convert cgroup_ida to cgroup_idr")
      Cc: <stable@vger.kernel.org> #3.12+
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0ab02ca8
  3. 08 2月, 2014 8 次提交
    • T
      cgroup: remove cgroup_root_mutex · 3417ae1f
      Tejun Heo 提交于
      cgroup_root_mutex was added to avoid deadlock involving namespace_sem
      via cgroup_show_options().  It added a lot of overhead for the small
      purpose of it and, because it's nested under cgroup_mutex, it has very
      limited usefulness.  The previous patch made cgroup_show_options() not
      use cgroup_root_mutex, so nobody needs it anymore.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3417ae1f
    • T
      cgroup: update locking in cgroup_show_options() · 69e943b7
      Tejun Heo 提交于
      cgroup_show_options() grabs cgroup_root_mutex to protect the options
      changing while printing; however, holding root_mutex or not doesn't
      really make much difference for the function.  subsys_mask can be
      atomically tested and most of the options aren't allowed to change
      anyway once mounted.
      
      The only field which needs synchronization is ->release_agent_path.
      This patch introduces a dedicated spinlock to synchronize accesses to
      the field and drops cgroup_root_mutex locking from
      cgroup_show_options().  The next patch will remove cgroup_root_mutex.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      69e943b7
    • T
      cgroup: rename cgroup_subsys->subsys_id to ->id · aec25020
      Tejun Heo 提交于
      It's no longer referenced outside cgroup core, so renaming is easy.
      Let's rename it for consistency & brevity.
      
      This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      aec25020
    • T
      cgroup: clean up cgroup_subsys names and initialization · 073219e9
      Tejun Heo 提交于
      cgroup_subsys is a bit messier than it needs to be.
      
      * The name of a subsys can be different from its internal identifier
        defined in cgroup_subsys.h.  Most subsystems use the matching name
        but three - cpu, memory and perf_event - use different ones.
      
      * cgroup_subsys_id enums are postfixed with _subsys_id and each
        cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
        included throughout various subsystems, it doesn't and shouldn't
        have claim on such generic names which don't have any qualifier
        indicating that they belong to cgroup.
      
      * cgroup_subsys->subsys_id should always equal the matching
        cgroup_subsys_id enum; however, we require each controller to
        initialize it and then BUG if they don't match, which is a bit
        silly.
      
      This patch cleans up cgroup_subsys names and initialization by doing
      the followings.
      
      * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
        cgroup_subsys with _cgrp_subsys.
      
      * With the above, renaming subsys identifiers to match the userland
        visible names doesn't cause any naming conflicts.  All non-matching
        identifiers are renamed to match the official names.
      
        cpu_cgroup -> cpu
        mem_cgroup -> memory
        perf -> perf_event
      
      * controllers no longer need to initialize ->subsys_id and ->name.
        They're generated in cgroup core and set automatically during boot.
      
      * Redundant cgroup_subsys declarations removed.
      
      * While updating BUG_ON()s in cgroup_init_early(), convert them to
        WARN()s.  BUGging that early during boot is stupid - the kernel
        can't print anything, even through serial console and the trap
        handler doesn't even link stack frame properly for back-tracing.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      073219e9
    • T
      cgroup: drop module support · 3ed80a62
      Tejun Heo 提交于
      With module supported dropped from net_prio, no controller is using
      cgroup module support.  None of actual resource controllers can be
      built as a module and we aren't gonna add new controllers which don't
      control resources.  This patch drops module support from cgroup.
      
      * cgroup_[un]load_subsys() and cgroup_subsys->module removed.
      
      * As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
        cgroup_subsys.h now uses IS_ENABLED() directly.
      
      * enum cgroup_subsys_id now exactly matches the list of enabled
        controllers as ordered in cgroup_subsys.h.
      
      * cgroup_subsys[] is now a contiguously occupied array.  Size
        specification is no longer necessary and dropped.
      
      * for_each_builtin_subsys() is removed and for_each_subsys() is
        updated to not require any locking.
      
      * module ref handling is removed from rebind_subsystems().
      
      * Module related comments dropped.
      
      v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
          classid handling into core").
      
      v3: Added {} around the if (need_forkexit_callback) block in
          cgroup_post_fork() for readability as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3ed80a62
    • T
      cgroup: fix locking in cgroup_cfts_commit() · 48573a89
      Tejun Heo 提交于
      cgroup_cfts_commit() walks the cgroup hierarchy that the target
      subsystem is attached to and tries to apply the file changes.  Due to
      the convolution with inode locking, it can't keep cgroup_mutex locked
      while iterating.  It currently holds only RCU read lock around the
      actual iteration and then pins the found cgroup using dget().
      
      Unfortunately, this is incorrect.  Although the iteration does check
      cgroup_is_dead() before invoking dget(), there's nothing which
      prevents the dentry from going away inbetween.  Note that this is
      different from the usual css iterations where css_tryget() is used to
      pin the css - css_tryget() tests whether the css can be pinned and
      fails if not.
      
      The problem can be solved by simply holding cgroup_mutex instead of
      RCU read lock around the iteration, which actually reduces LOC.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      48573a89
    • T
      cgroup: fix error return from cgroup_create() · b58c8998
      Tejun Heo 提交于
      cgroup_create() was returning 0 after allocation failures.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      b58c8998
    • T
      cgroup: fix error return value in cgroup_mount() · eb46bf89
      Tejun Heo 提交于
      When cgroup_mount() fails to allocate an id for the root, it didn't
      set ret before jumping to unlock_drop ending up returning 0 after a
      failure.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: stable@vger.kernel.org
      eb46bf89
  4. 07 2月, 2014 1 次提交
    • H
      cgroup: use an ordered workqueue for cgroup destruction · ab3f5faa
      Hugh Dickins 提交于
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding cgroup_mutex
      which prevents the child from reaching its mem_cgroup_reparent_charges().
      
      Just use an ordered workqueue for cgroup_destroy_wq.
      
      tj: Committing as the temporary fix until the reverse dependency can
          be removed from memcg.  Comment updated accordingly.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Suggested-by: NFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # 3.10+
      Signed-off-by: NTejun Heo <tj@kernel.org>
      ab3f5faa
  5. 18 1月, 2014 1 次提交
  6. 17 12月, 2013 1 次提交
    • L
      cgroup: don't recycle cgroup id until all csses' have been destroyed · c1a71504
      Li Zefan 提交于
      Hugh reported this bug:
      
      > CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
      >
      > mkdir -p /tmp/tmpfs /tmp/memcg
      > mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
      > mount -t cgroup -o memory memcg /tmp/memcg
      > mkdir /tmp/memcg/old
      > echo 512M >/tmp/memcg/old/memory.limit_in_bytes
      > echo $$ >/tmp/memcg/old/tasks
      > cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
      > echo $$ >/tmp/memcg/tasks
      > rmdir /tmp/memcg/old
      > sleep 1	# let rmdir work complete
      > mkdir /tmp/memcg/new
      > umount /tmp/tmpfs
      > dmesg | grep WARNING
      > rmdir /tmp/memcg/new
      > umount /tmp/memcg
      >
      > Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
      >                            res_counter_uncharge_locked+0x1f/0x2f()
      >
      > Breakage comes from 34c00c31 ("memcg: convert to use cgroup id").
      >
      > The lifetime of a cgroup id is different from the lifetime of the
      > css id it replaced: memsw's css_get()s do nothing to hold on to the
      > old cgroup id, it soon gets recycled to a new cgroup, which then
      > mysteriously inherits the old's swap, without any charge for it.
      
      Instead of removing cgroup id right after all the csses have been
      offlined, we should do that after csses have been destroyed.
      
      To make sure an invalid css pointer won't be returned after the css
      is destroyed, make sure css_from_id() returns NULL in this case.
      
      tj: Updated comment to note planned changes for cgrp->id.
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c1a71504
  7. 14 12月, 2013 1 次提交
  8. 12 12月, 2013 1 次提交
  9. 07 12月, 2013 8 次提交
    • T
      cgroup: remove for_each_root_subsys() · b85d2040
      Tejun Heo 提交于
      After the previous patch which introduced for_each_css(),
      for_each_root_subsys() only has two users left.  This patch replaces
      it with for_each_subsys() + explicit subsys_mask testing and remove
      for_each_root_subsys() along with cgroupfs_root->subsys_list handling.
      
      This patch doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b85d2040
    • T
      cgroup: implement for_each_css() · 1c6727af
      Tejun Heo 提交于
      There are enough places where css's of a cgroup are iterated, which
      currently uses for_each_root_subsys() + explicit cgroup_css().  This
      patch implements for_each_css() and replaces the above combination
      with it.
      
      This patch doesn't introduce any behavior changes.
      
      v2: Updated to apply cleanly on top of v2 of "cgroup: fix css leaks on
          online_css() failure"
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      1c6727af
    • T
      cgroup: factor out cgroup_subsys_state creation into create_css() · c81c925a
      Tejun Heo 提交于
      Now that all opertations to create a css (cgroup_subsys_state) are
      collected into a single loop in cgroup_create(), it's easy to factor
      it out into its own function.  Factor out css creation into
      create_css().  This makes the code easier to follow and will enable
      decoupling css creation from cgroup creation which is necessary for
      the planned unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c81c925a
    • T
      cgroup: combine css handling loops in cgroup_create() · 9d403e99
      Tejun Heo 提交于
      Now that css operations in cgroup_create() are back-to-back, there
      isn't much point in allocating css's in one loop and onlining them in
      another.  Merge the two loops so that a css is allocated and onlined
      on each iteration.
      
      css_ar[] is no longer necessary and replaced with a single pointer.
      This also simplifies the error handling path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9d403e99
    • T
      cgroup: reorder operations in cgroup_create() · 0d80255e
      Tejun Heo 提交于
      cgroup_create() currently does the followings.
      
      1. alloc cgroup
      2. alloc css's
      3. create the directory and commit to cgroup creation
      4. online css's
      5. create cgroup and css files
      
      The sequence performs allocations before other operations but it
      doesn't buy anything because each of the above steps may fail and
      should be unrollable.  Reorganize the sequence such that cgroup
      operations are done before css operations.
      
      1. alloc cgroup
      2. create the directory and files and commit to cgroup creation
      3. alloc css's
      4. create files for and online css's
      
      This simplifies the code a bit and enables further simplification and
      separating out css creation from cgroup creation which is necessary
      for the planned unified hierarchy where css's will be created and
      destroyed dynamically across the lifetime of a cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0d80255e
    • T
      cgroup: make for_each_subsys() useable under cgroup_root_mutex · 780cd8b3
      Tejun Heo 提交于
      We want to use for_each_subsys() in cgroupfs_root handling where only
      cgroup_root_mutex is held.  The only way cgroup_subsys[] can change is
      through module load/unload, make cgroup_[un]load_subsys() grab
      cgroup_root_mutex too and update the lockdep annotation in
      for_each_subsys() to allow either cgroup_mutex or cgroup_root_mutex.
      
      * Lockdep annotation is moved from inner 'if' condition to outer 'for'
        init caluse.  There's no reason to execute the assertion every loop.
      
      * Loop index @i is renamed to @ssid.  Indices iterating through subsys
        will be [re]named to @ssid gradually.
      
      v2: cgroup_assert_mutex_or_root_locked() caused build failure if
          !CONFIG_LOCKEDP.  Conditionalize its definition.  The build failure
          was reported by kbuild test bot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      780cd8b3
    • T
      cgroup: css iterations and css_from_dir() are safe under cgroup_mutex · 87fb54f1
      Tejun Heo 提交于
      Currently, all css iterations and css_from_dir() require RCU read lock
      whether the caller is holding cgroup_mutex or not, which is
      unnecessarily restrictive.  They are all safe to use under
      cgroup_mutex without holding RCU read lock.
      
      Factor out cgroup_assert_mutex_or_rcu_locked() from css_from_id() and
      apply it to all css iteration functions and css_from_dir().
      
      v2: cgroup_assert_mutex_or_rcu_locked() definition doesn't need to be
          inside CONFIG_PROVE_RCU ifdef as rcu_lockdep_assert() is always
          defined and conditionalized.  Move it outside of the ifdef block.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      87fb54f1
    • T
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo 提交于
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Reported-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  10. 06 12月, 2013 4 次提交
    • T
      cgroup: unify pidlist and other file handling · 6612f05b
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  With the previous
      changes, the difference between pidlist and other files are very
      small.  Both are served by seq_file in a pretty standard way with the
      only difference being !pidlist files use single_open().
      
      This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
      implements the matching cgroup_seqfile_start/next/stop() which either
      emulates single_open() behavior or invokes cftype->seq_*() operations
      if specified.  This allows using single seq_operations for both
      pidlist and other files and makes cgroup_pidlist_operations and
      cgorup_pidlist_open() no longer necessary.  As cgroup_pidlist_open()
      was the only user of cftype->open(), the method is dropped together.
      
      This brings cftype file interface very close to kernfs interface and
      mapping shouldn't be too difficult.  Once converted to kernfs, most of
      the plumbing code including cgroup_seqfile_*() will be removed as
      kernfs provides those facilities.
      
      This patch does not introduce any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      
      v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
          to all cgroup files".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6612f05b
    • T
      cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      replaces cftype->read_seq_string() with cftype->seq_show() which is
      not limited to single_open() operation and will map directcly to
      kernfs seq_file interface.
      
      The conversions are mechanical.  As ->seq_show() doesn't have @css and
      @cft, the functions which make use of them are converted to use
      seq_css() and seq_cft() respectively.  In several occassions, e.f. if
      it has seq_string in its name, the function name is updated to fit the
      new method better.
      
      This patch does not introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      2da8ca82
    • T
      cgroup: attach cgroup_open_file to all cgroup files · 7da11279
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      attaches cgroup_open_file, which used to be attached to pidlist files,
      to all cgroup files, introduces seq_css/cft() accessors to determine
      the cgroup_subsys_state and cftype associated with a given cgroup
      seq_file, exports them as public interface.
      
      This doesn't cause any behavior changes but unifies cgroup file
      handling across different file types and will help converting them to
      kernfs seq_show() interface.
      
      v2: Li pointed out that the original patch was using
          single_open_size() incorrectly assuming that the size param is
          private data size.  Fix it by allocating @of separately and
          passing it to single_open() and explicitly freeing it in the
          release path.  This isn't the prettiest but this path is gonna be
          restructured by the following patches pretty soon.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7da11279
    • T
      cgroup: generalize cgroup_pidlist_open_file · 5d22444f
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch renames
      cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
      only contains a field to identify the specific file, ->cfe, and an
      opaque ->priv pointer.  When cgroup is converted to kernfs, this will
      be replaced by kernfs_open_file which contains about the same
      information.
      
      As whether the file is "cgroup.procs" or "tasks" should now be
      determined from cgroup_open_file->cfe, the cftype->private for the two
      files now carry the file type and cgroup_pidlist_start() reads the
      type through cfe->type->private.  This makes the distinction between
      cgroup_tasks_open() and cgroup_procs_open() unnecessary.
      cgroup_pidlist_open() is now directly used as the open method.
      
      This patch doesn't make any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5d22444f