1. 16 4月, 2013 1 次提交
  2. 15 4月, 2013 3 次提交
    • L
      cgroup: remove cgrp->top_cgroup · 05fb22ec
      Li Zefan 提交于
      It's not used, and it can be retrieved via cgrp->root->top_cgroup.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      05fb22ec
    • T
      cgroup: introduce sane_behavior mount option · 873fe09e
      Tejun Heo 提交于
      It's a sad fact that at this point various cgroup controllers are
      carrying so many idiosyncrasies and pure insanities that it simply
      isn't possible to reach any sort of sane consistent behavior while
      maintaining staying fully compatible with what already has been
      exposed to userland.
      
      As we can't break exposed userland interface, transitioning to sane
      behaviors can only be done in steps while maintaining backwards
      compatibility.  This patch introduces a new mount option -
      __DEVEL__sane_behavior - which disables crazy features and enforces
      consistent behaviors in cgroup core proper and various controllers.
      As exactly which behaviors it changes are still being determined, the
      mount option, at this point, is useful only for development of the new
      behaviors.  As such, the mount option is prefixed with __DEVEL__ and
      generates a warning message when used.
      
      Eventually, once we get to the point where all controller's behaviors
      are consistent enough to implement unified hierarchy, the __DEVEL__
      prefix will be dropped, and more importantly, unified-hierarchy will
      enforce sane_behavior by default.  Maybe we'll able to completely drop
      the crazy stuff after a while, maybe not, but we at least have a
      strategy to move on to saner behaviors.
      
      This patch introduces the mount option and changes the following
      behaviors in cgroup core.
      
      * Mount options "noprefix" and "clone_children" are disallowed.  Also,
        cgroupfs file cgroup.clone_children is not created.
      
      * When mounting an existing superblock, mount options should match.
        This is currently pretty crazy.  If one mounts a cgroup, creates a
        subdirectory, unmounts it and then mount it again with different
        option, it looks like the new options are applied but they aren't.
      
      * Remount is disallowed.
      
      The behaviors changes are documented in the comment above
      CGRP_ROOT_SANE_BEHAVIOR enum and will be expanded as different
      controllers are converted and planned improvements progress.
      
      v2: Dropped unnecessary explicit file permission setting sane_behavior
          cftype entry as suggested by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      873fe09e
    • T
      move cgroupfs_root to include/linux/cgroup.h · 25a7e684
      Tejun Heo 提交于
      While controllers shouldn't be accessing cgroupfs_root directly, it
      being hidden inside kern/cgroup.c makes somethings pretty silly.  This
      makes routing hierarchy-wide settings which need to be visible to
      controllers cumbersome.
      
      We're gonna add another hierarchy-wide setting which needs to be
      accessed from controllers.  Move cgroupfs_root and its flags to the
      header file so that we can access root settings with inline helpers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      25a7e684
  3. 13 4月, 2013 1 次提交
  4. 11 4月, 2013 2 次提交
    • L
      cgroup: implement cgroup_is_descendant() · 78574cf9
      Li Zefan 提交于
      A couple controllers want to determine whether two cgroups are in
      ancestor/descendant relationship.  As it's more likely that the
      descendant is the primary subject of interest and there are other
      operations focusing on the descendants, let's ask is_descendent rather
      than is_ancestor.
      
      Implementation is trivial as the previous patch guarantees that all
      ancestors of a cgroup stay accessible as long as the cgroup is
      accessible.
      
      tj: Removed depth optimization, renamed from cgroup_is_ancestor(),
          rewrote descriptions.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      78574cf9
    • R
      cgroup: remove bind() method from cgroup_subsys. · 84cfb6ab
      Rami Rosen 提交于
      The bind() method of cgroup_subsys is not used in any of the
      controllers (cpuset, freezer, blkio, net_cls, memcg, net_prio,
      devices, perf, hugetlb, cpu and cpuacct)
      
      tj: Removed the entry on ->bind() from
          Documentation/cgroups/cgroups.txt.  Also updated a couple
          paragraphs which were suggesting that dynamic re-binding may be
          implemented.  It's not gonna.
      Signed-off-by: NRami Rosen <ramirose@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      84cfb6ab
  5. 08 4月, 2013 3 次提交
    • T
      cgroup: remove cgroup_lock_is_held() · 2219449a
      Tejun Heo 提交于
      We don't want controllers to assume that the information is officially
      available and do funky things with it.
      
      The only user is task_subsys_state_check() which uses it to verify RCU
      access context.  We can move cgroup_lock_is_held() inside
      CONFIG_PROVE_RCU but that doesn't add meaningful protection compared
      to conditionally exposing cgroup_mutex.
      
      Remove cgroup_lock_is_held(), export cgroup_mutex iff CONFIG_PROVE_RCU
      and use lockdep_is_held() directly on the mutex in
      task_subsys_state_check().
      
      While at it, add parentheses around macro arguments in
      task_subsys_state_check().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      2219449a
    • T
      cgroup: unexport locking interface and cgroup_attach_task() · b9777cf8
      Tejun Heo 提交于
      Now that all external cgroup_lock() users are gone, we can finally
      unexport the locking interface and prevent future abuse of
      cgroup_mutex.
      
      Make cgroup_[un]lock() and cgroup_lock_live_group() static.  Also,
      cgroup_attach_task() doesn't have any user left and can't be used
      without locking interface anyway.  Make it static too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b9777cf8
    • T
      cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks() · 8cc99345
      Tejun Heo 提交于
      When a cpuset becomes empty (no CPU or memory), its tasks are
      transferred with the nearest ancestor with execution resources.  This
      is implemented using cgroup_scan_tasks() with a callback which grabs
      cgroup_mutex and invokes cgroup_attach_task() on each task.
      
      Both cgroup_mutex and cgroup_attach_task() are scheduled to be
      unexported.  Implement cgroup_transfer_tasks() in cgroup proper which
      is essentially the same as move_member_tasks_to_cpuset() except that
      it takes cgroups instead of cpusets and @to comes before @from like
      normal functions with those arguments, and replace
      move_member_tasks_to_cpuset() with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8cc99345
  6. 20 3月, 2013 1 次提交
  7. 13 3月, 2013 1 次提交
  8. 06 3月, 2013 1 次提交
  9. 05 3月, 2013 1 次提交
    • L
      cgroup: fix cgroup_path() vs rename() race · 65dff759
      Li Zefan 提交于
      rename() will change dentry->d_name. The result of this race can
      be worse than seeing partially rewritten name, but we might access
      a stale pointer because rename() will re-allocate memory to hold
      a longer name.
      
      As accessing dentry->name must be protected by dentry->d_lock or
      parent inode's i_mutex, while on the other hand cgroup-path() can
      be called with some irq-safe spinlocks held, we can't generate
      cgroup path using dentry->d_name.
      
      Alternatively we make a copy of dentry->d_name and save it in
      cgrp->name when a cgroup is created, and update cgrp->name at
      rename().
      
      v5: use flexible array instead of zero-size array.
      v4: - allocate root_cgroup_name and all root_cgroup->name points to it.
          - add cgroup_name() wrapper.
      v3: use kfree_rcu() instead of synchronize_rcu() in user-visible path.
      v2: make cgrp->name RCU safe.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      65dff759
  10. 25 1月, 2013 1 次提交
  11. 08 1月, 2013 2 次提交
  12. 20 11月, 2012 9 次提交
  13. 10 11月, 2012 3 次提交
    • T
      cgroup: implement generic child / descendant walk macros · 574bd9f7
      Tejun Heo 提交于
      Currently, cgroup doesn't provide any generic helper for walking a
      given cgroup's children or descendants.  This patch adds the following
      three macros.
      
      * cgroup_for_each_child() - walk immediate children of a cgroup.
      
      * cgroup_for_each_descendant_pre() - visit all descendants of a cgroup
        in pre-order tree traversal.
      
      * cgroup_for_each_descendant_post() - visit all descendants of a
        cgroup in post-order tree traversal.
      
      All three only require the user to hold RCU read lock during
      traversal.  Verifying that each iterated cgroup is online is the
      responsibility of the user.  When used with proper synchronization,
      cgroup_for_each_descendant_pre() can be used to propagate state
      updates to descendants in reliable way.  See comments for details.
      
      v2: s/config/state/ in commit message and comments per Michal.  More
          documentation on synchronization rules.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujisu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      574bd9f7
    • T
      cgroup: use rculist ops for cgroup->children · eb6fd504
      Tejun Heo 提交于
      Use RCU safe list operations for cgroup->children.  This will be used
      to implement cgroup children / descendant walking which can be used by
      controllers.
      
      Note that cgroup_create() now puts a new cgroup at the end of the
      ->children list instead of head.  This isn't strictly necessary but is
      done so that the iteration order is more conventional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      eb6fd504
    • T
      cgroup: add cgroup_subsys->post_create() · a8638030
      Tejun Heo 提交于
      Currently, there's no way for a controller to find out whether a new
      cgroup finished all ->create() allocatinos successfully and is
      considered "live" by cgroup.
      
      This becomes a problem later when we add generic descendants walking
      to cgroup which can be used by controllers as controllers don't have a
      synchronization point where it can synchronize against new cgroups
      appearing in such walks.
      
      This patch adds ->post_create().  It's called after all ->create()
      succeeded and the cgroup is linked into the generic cgroup hierarchy.
      This plays the counterpart of ->pre_destroy().
      
      When used in combination with the to-be-added generic descendant
      iterators, ->post_create() can be used to implement reliable state
      inheritance.  It will be explained with the descendant iterators.
      
      v2: Added a paragraph about its future use w/ descendant iterators per
          Michal.
      
      v3: Forgot to add ->post_create() invocation to cgroup_load_subsys().
          Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Glauber Costa <glommer@parallels.com>
      a8638030
  14. 06 11月, 2012 4 次提交
    • T
      cgroup: make ->pre_destroy() return void · bcf6de1b
      Tejun Heo 提交于
      All ->pre_destory() implementations return 0 now, which is the only
      allowed return value.  Make it return void.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      bcf6de1b
    • T
      cgroup: remove CGRP_WAIT_ON_RMDIR, cgroup_exclude_rmdir() and cgroup_release_and_wakeup_rmdir() · b25ed609
      Tejun Heo 提交于
      CGRP_WAIT_ON_RMDIR is another kludge which was added to make cgroup
      destruction rollback somewhat working.  cgroup_rmdir() used to drain
      CSS references and CGRP_WAIT_ON_RMDIR and the associated waitqueue and
      helpers were used to allow the task performing rmdir to wait for the
      next relevant event.
      
      Unfortunately, the wait is visible to controllers too and the
      mechanism got exposed to memcg by 88703267 ("cgroup avoid permanent
      sleep at rmdir").
      
      Now that the draining and retries are gone, CGRP_WAIT_ON_RMDIR is
      unnecessary.  Remove it and all the mechanisms supporting it.  Note
      that memcontrol.c changes are essentially revert of 88703267
      ("cgroup avoid permanent sleep at rmdir").
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      b25ed609
    • T
      cgroup: kill CSS_REMOVED · e9316080
      Tejun Heo 提交于
      CSS_REMOVED is one of the several contortions which were necessary to
      support css reference draining on cgroup removal.  All css->refcnts
      which need draining should be deactivated and verified to equal zero
      atomically w.r.t. css_tryget().  If any one isn't zero, all refcnts
      needed to be re-activated and css_tryget() shouldn't fail in the
      process.
      
      This was achieved by letting css_tryget() busy-loop until either the
      refcnt is reactivated (failed removal attempt) or CSS_REMOVED is set
      (committing to removal).
      
      Now that css refcnt draining is no longer used, there's no need for
      atomic rollback mechanism.  css_tryget() simply can look at the
      reference count and fail if it's deactivated - it's never getting
      re-activated.
      
      This patch removes CSS_REMOVED and updates __css_tryget() to fail if
      the refcnt is deactivated.  As deactivation and removal are a single
      step now, they no longer need to be protected against css_tryget()
      happening from irq context.  Remove local_irq_disable/enable() from
      cgroup_rmdir().
      
      Note that this removes css_is_removed() whose only user is VM_BUG_ON()
      in memcontrol.c.  We can replace it with a check on the refcnt but
      given that the only use case is a debug assert, I think it's better to
      simply unexport it.
      
      v2: Comment updated and explanation on local_irq_disable/enable()
          added per Michal Hocko.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      e9316080
    • T
      cgroup: kill cgroup_subsys->__DEPRECATED_clear_css_refs · ed957793
      Tejun Heo 提交于
      2ef37d3f ("memcg: Simplify mem_cgroup_force_empty_list error
      handling") removed the last user of __DEPRECATED_clear_css_refs.  This
      patch removes __DEPRECATED_clear_css_refs and mechanisms to support
      it.
      
      * Conditionals dependent on __DEPRECATED_clear_css_refs removed.
      
      * cgroup_clear_css_refs() can no longer fail.  All that needs to be
        done are deactivating refcnts, setting CSS_REMOVED and putting the
        base reference on each css.  Remove cgroup_clear_css_refs() and the
        failure path, and open-code the loops into cgroup_rmdir().
      
      This patch keeps the two for_each_subsys() loops separate while open
      coding them.  They can be merged now but there are scheduled changes
      which need them to be separate, so keep them separate to reduce the
      amount of churn.
      
      local_irq_save/restore() from cgroup_clear_css_refs() are replaced
      with local_irq_disable/enable() for simplicity.  This is safe as
      cgroup_rmdir() is always called with IRQ enabled.  Note that this IRQ
      switching is necessary to ensure that css_tryget() isn't called from
      IRQ context on the same CPU while lower context is between CSS
      deactivation and setting CSS_REMOVED as css_tryget() would hang
      forever in such cases waiting for CSS to be re-activated or
      CSS_REMOVED set.  This will go away soon.
      
      v2: cgroup_call_pre_destroy() removal dropped per Michal.  Commit
          message updated to explain local_irq_disable/enable() conversion.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ed957793
  15. 17 10月, 2012 1 次提交
    • T
      cgroup: cgroup_subsys->fork() should be called after the task is added to css_set · 5edee61e
      Tejun Heo 提交于
      cgroup core has a bug which violates a basic rule about event
      notifications - when a new entity needs to be added, you add that to
      the notification list first and then make the new entity conform to
      the current state.  If done in the reverse order, an event happening
      inbetween will be lost.
      
      cgroup_subsys->fork() is invoked way before the new task is added to
      the css_set.  Currently, cgroup_freezer is the only user of ->fork()
      and uses it to make new tasks conform to the current state of the
      freezer.  If FROZEN state is requested while fork is in progress
      between cgroup_fork_callbacks() and cgroup_post_fork(), the child
      could escape freezing - the cgroup isn't frozen when ->fork() is
      called and the freezer couldn't see the new task on the css_set.
      
      This patch moves cgroup_subsys->fork() invocation to
      cgroup_post_fork() after the new task is added to the css_set.
      cgroup_fork_callbacks() is removed.
      
      Because now a task may be migrated during cgroup_subsys->fork(),
      freezer_fork() is updated so that it adheres to the usual RCU locking
      and the rather pointless comment on why locking can be different there
      is removed (if it doesn't make anything simpler, why even bother?).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: stable@vger.kernel.org
      5edee61e
  16. 15 9月, 2012 5 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Define CGROUP_SUBSYS_COUNT according the configuration · a6f00298
      Daniel Wagner 提交于
      Since we know exactly how many subsystems exists at compile time we are
      able to define CGROUP_SUBSYS_COUNT correctly. CGROUP_SUBSYS_COUNT will
      be at max 12 (all controllers enabled). Depending on the architecture
      we safe either 32 - 12 pointers (80 bytes) or 64 - 12 pointers (416
      bytes) per cgroup.
      
      With this change we can also remove the temporary placeholder to avoid
      compilation errors.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      a6f00298
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: Wrap subsystem selection macro · 5fc0b025
      Daniel Wagner 提交于
      Before we are able to define all subsystem ids at compile time we need
      a more fine grained control what gets defined when we include
      cgroup_subsys.h. For example we define the enums for the subsystems or
      to declare for struct cgroup_subsys (builtin subsystem) by including
      cgroup_subsys.h and defining SUBSYS accordingly.
      
      Currently, the decision if a subsys is used is defined inside the
      header by testing if CONFIG_*=y is true. By moving this test outside
      of cgroup_subsys.h we are able to control it on the include level.
      
      This is done by introducing IS_SUBSYS_ENABLED which then is defined
      according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      5fc0b025
    • D
      cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT · be45c900
      Daniel Wagner 提交于
      CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
      looping over the subsys array looking either at the builtin or the
      module subsystems. Since all the builtin subsystems have an id which
      is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
      have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
      are sorted.
      
      We are about to change id assignment to happen only at compile time
      later in this series. That means we can't rely on the above trick
      since all ids will always be defined at compile time. Furthermore,
      ordering the builtin subsystems and the module subsystems is not
      really necessary.
      
      So we need a different way to know which subsystem is a builtin or a
      module one. We can use the subsys[]->module pointer for this. Any
      place where we need to know if a subsys is module we just check for
      the pointer. If it is NULL then the subsystem is a builtin one.
      
      With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
      enum. Though we need to introduce a temporary placeholder so that we
      don't get a compilation error when only CONFIG_CGROUP is selected and
      no single controller. An empty enum definition is not valid. Later in
      this series we are able to remove the placeholder again.
      
      And with this change we get a fix for this:
      
      kernel/cgroup.c: In function ‘cgroup_load_subsys’:
      kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
      
      when CONFIG_CGROUP=y and no built in controller was enabled.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      be45c900
  17. 25 8月, 2012 1 次提交
    • A
      cgroup: add xattr support · 03b1cde6
      Aristeu Rozanski 提交于
      This is one of the items in the plumber's wish list.
      
      For use cases:
      
      >> What would the use case be for this?
      >
      > Attaching meta information to services, in an easily discoverable
      > way. For example, in systemd we create one cgroup for each service, and
      > could then store data like the main pid of the specific service as an
      > xattr on the cgroup itself. That way we'd have almost all service state
      > in the cgroupfs, which would make it possible to terminate systemd and
      > later restart it without losing any state information. But there's more:
      > for example, some very peculiar services cannot be terminated on
      > shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
      > services in question could just mark that on their cgroup, by setting an
      > xattr. On the more desktopy side of things there are other
      > possibilities: for example there are plans defining what an application
      > is along the lines of a cgroup (i.e. an app being a collection of
      > processes). With xattrs one could then attach an icon or human readable
      > program name on the cgroup.
      >
      > The key idea is that this would allow attaching runtime meta information
      > to cgroups and everything they model (services, apps, vms), that doesn't
      > need any complex userspace infrastructure, has good access control
      > (i.e. because the file system enforces that anyway, and there's the
      > "trusted." xattr namespace), notifications (inotify), and can easily be
      > shared among applications.
      >
      > Lennart
      
      v7:
      - no changes
      v6:
      - remove user xattr namespace, only allow trusted and security
      v5:
      - check for capabilities before setting/removing xattrs
      v4:
      - no changes
      v3:
      - instead of config option, use mount option to enable xattr support
      Original-patch-by: NLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      03b1cde6