1. 17 10月, 2012 1 次提交
    • D
      cgroup: notify_on_release may not be triggered in some cases · 1f5320d5
      Daisuke Nishimura 提交于
      notify_on_release must be triggered when the last process in a cgroup is
      move to another. But if the first(and only) process in a cgroup is moved to
      another, notify_on_release is not triggered.
      
      	# mkdir /cgroup/cpu/SRC
      	# mkdir /cgroup/cpu/DST
      	#
      	# echo 1 >/cgroup/cpu/SRC/notify_on_release
      	# echo 1 >/cgroup/cpu/DST/notify_on_release
      	#
      	# sleep 300 &
      	[1] 8629
      	#
      	# echo 8629 >/cgroup/cpu/SRC/tasks
      	# echo 8629 >/cgroup/cpu/DST/tasks
      	-> notify_on_release for /SRC must be triggered at this point,
      	   but it isn't.
      
      This is because put_css_set() is called before setting CGRP_RELEASABLE
      in cgroup_task_migrate(), and is a regression introduce by the
      commit:74a1166d(cgroups: make procs file writable), which was merged
      into v3.0.
      
      Cc: Ben Blum <bblum@andrew.cmu.edu>
      Cc: <stable@vger.kernel.org> # v3.0.x and later
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1f5320d5
  2. 15 9月, 2012 5 次提交
    • T
      cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb
      Tejun Heo 提交于
      Currently, cgroup hierarchy support is a mess.  cpu related subsystems
      behave correctly - configuration, accounting and control on a parent
      properly cover its children.  blkio and freezer completely ignore
      hierarchy and treat all cgroups as if they're directly under the root
      cgroup.  Others show yet different behaviors.
      
      These differing interpretations of cgroup hierarchy make using cgroup
      confusing and it impossible to co-mount controllers into the same
      hierarchy and obtain sane behavior.
      
      Eventually, we want full hierarchy support from all subsystems and
      probably a unified hierarchy.  Users using separate hierarchies
      expecting completely different behaviors depending on the mounted
      subsystem is deterimental to making any progress on this front.
      
      This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
      for controllers which are lacking in hierarchy support.  The goal of
      this patch is two-fold.
      
      * Move users away from using hierarchy on currently non-hierarchical
        subsystems, so that implementing proper hierarchy support on those
        doesn't surprise them.
      
      * Keep track of which controllers are broken how and nudge the
        subsystems to implement proper hierarchy support.
      
      For now, start with a single warning message.  We can whine louder
      later on.
      
      v2: Fixed a typo spotted by Michal. Warning message updated.
      
      v3: Updated memcg part so that it doesn't generate warning in the
          cases where .use_hierarchy=false doesn't make the behavior
          different from root.use_hierarchy=true.  Fixed a typo spotted by
          Glauber.
      
      v4: Check ->broken_hierarchy after cgroup creation is complete so that
          ->create() can affect the result per Michal.  Dropped unnecessary
          memcg root handling per Michal.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Thomas Graf <tgraf@suug.ch>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      8c7f6edb
    • D
      cgroup: Assign subsystem IDs during compile time · 8a8e04df
      Daniel Wagner 提交于
      WARNING: With this change it is impossible to load external built
      controllers anymore.
      
      In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
      set, corresponding subsys_id should also be a constant. Up to now,
      net_prio_subsys_id and net_cls_subsys_id would be of the type int and
      the value would be assigned during runtime.
      
      By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
      to IS_ENABLED, all *_subsys_id will have constant value. That means we
      need to remove all the code which assumes a value can be assigned to
      net_prio_subsys_id and net_cls_subsys_id.
      
      A close look is necessary on the RCU part which was introduces by
      following patch:
      
        commit f8451725
        Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
        Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010
      
        cls_cgroup: Store classid in struct sock
      
        Tis code was added to init_cgroup_cls()
      
      	  /* We can't use rcu_assign_pointer because this is an int. */
      	  smp_wmb();
      	  net_cls_subsys_id = net_cls_subsys.subsys_id;
      
        respectively to exit_cgroup_cls()
      
      	  net_cls_subsys_id = -1;
      	  synchronize_rcu();
      
        and in module version of task_cls_classid()
      
      	  rcu_read_lock();
      	  id = rcu_dereference(net_cls_subsys_id);
      	  if (id >= 0)
      		  classid = container_of(task_subsys_state(p, id),
      					 struct cgroup_cls_state, css)->classid;
      	  rcu_read_unlock();
      
      Without an explicit explaination why the RCU part is needed. (The
      rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
      in a later commit, but that is a minor detail.)
      
      So here is my pondering why it was introduced and why it safe to
      remove it now. Note that this code was copied over to net_prio the
      reasoning holds for that subsystem too.
      
      The idea behind the RCU use for net_cls_subsys_id is to make sure we
      get a valid pointer back from task_subsys_state(). task_subsys_state()
      is just blindly accessing the subsys array and returning the
      pointer. Obviously, passing in -1 as id into task_subsys_state()
      returns an invalid value (out of lower bound).
      
      So this code makes sure that only after module is loaded and the
      subsystem registered, the id is assigned.
      
      Before unregistering the module all old readers must have left the
      critical section. This is done by assigning -1 to the id and issuing a
      synchronized_rcu(). Any new readers wont call task_subsys_state()
      anymore and therefore it is safe to unregister the subsystem.
      
      The new code relies on the same trick, but it looks at the subsys
      pointer return by task_subsys_state() (remember the id is constant
      and therefore we allways have a valid index into the subsys
      array).
      
      No precautions need to be taken during module loading
      module. Eventually, all CPUs will get a valid pointer back from
      task_subsys_state() because rebind_subsystem() which is called after
      the module init() function will assigned subsys[net_cls_subsys_id] the
      newly loaded module subsystem pointer.
      
      When the subsystem is about to be removed, rebind_subsystem() will
      called before the module exit() function. In this case,
      rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
      and then it calls synchronize_rcu(). All old readers have left by then
      the critical section. Any new reader wont access the subsystem
      anymore.  At this point we are safe to unregister the subsystem. No
      synchronize_rcu() call is needed.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      8a8e04df
    • D
      cgroup: Do not depend on a given order when populating the subsys array · 80f4c877
      Daniel Wagner 提交于
      The *_subsys_id will be used as index to access the subsys. Therefore
      we need to care we populate the subsystem at the correct position by
      using designated initialization.
      
      With this change we are able to interleave builtin and modules in the subsys
      array.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      80f4c877
    • D
      cgroup: Wrap subsystem selection macro · 5fc0b025
      Daniel Wagner 提交于
      Before we are able to define all subsystem ids at compile time we need
      a more fine grained control what gets defined when we include
      cgroup_subsys.h. For example we define the enums for the subsystems or
      to declare for struct cgroup_subsys (builtin subsystem) by including
      cgroup_subsys.h and defining SUBSYS accordingly.
      
      Currently, the decision if a subsys is used is defined inside the
      header by testing if CONFIG_*=y is true. By moving this test outside
      of cgroup_subsys.h we are able to control it on the include level.
      
      This is done by introducing IS_SUBSYS_ENABLED which then is defined
      according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      5fc0b025
    • D
      cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT · be45c900
      Daniel Wagner 提交于
      CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
      looping over the subsys array looking either at the builtin or the
      module subsystems. Since all the builtin subsystems have an id which
      is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
      have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
      are sorted.
      
      We are about to change id assignment to happen only at compile time
      later in this series. That means we can't rely on the above trick
      since all ids will always be defined at compile time. Furthermore,
      ordering the builtin subsystems and the module subsystems is not
      really necessary.
      
      So we need a different way to know which subsystem is a builtin or a
      module one. We can use the subsys[]->module pointer for this. Any
      place where we need to know if a subsys is module we just check for
      the pointer. If it is NULL then the subsystem is a builtin one.
      
      With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
      enum. Though we need to introduce a temporary placeholder so that we
      don't get a compilation error when only CONFIG_CGROUP is selected and
      no single controller. An empty enum definition is not valid. Later in
      this series we are able to remove the placeholder again.
      
      And with this change we get a fix for this:
      
      kernel/cgroup.c: In function ‘cgroup_load_subsys’:
      kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]
      
      when CONFIG_CGROUP=y and no built in controller was enabled.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Gao feng <gaofeng@cn.fujitsu.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: netdev@vger.kernel.org
      Cc: cgroups@vger.kernel.org
      be45c900
  3. 25 8月, 2012 3 次提交
    • A
      cgroup: rename subsys_bits to subsys_mask · a1a71b45
      Aristeu Rozanski 提交于
      In a previous discussion, Tejun Heo suggested to rename references to
      subsys_bits (added_bits, removed_bits, etc) by something more meaningful.
      
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a1a71b45
    • A
      cgroup: add xattr support · 03b1cde6
      Aristeu Rozanski 提交于
      This is one of the items in the plumber's wish list.
      
      For use cases:
      
      >> What would the use case be for this?
      >
      > Attaching meta information to services, in an easily discoverable
      > way. For example, in systemd we create one cgroup for each service, and
      > could then store data like the main pid of the specific service as an
      > xattr on the cgroup itself. That way we'd have almost all service state
      > in the cgroupfs, which would make it possible to terminate systemd and
      > later restart it without losing any state information. But there's more:
      > for example, some very peculiar services cannot be terminated on
      > shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
      > services in question could just mark that on their cgroup, by setting an
      > xattr. On the more desktopy side of things there are other
      > possibilities: for example there are plans defining what an application
      > is along the lines of a cgroup (i.e. an app being a collection of
      > processes). With xattrs one could then attach an icon or human readable
      > program name on the cgroup.
      >
      > The key idea is that this would allow attaching runtime meta information
      > to cgroups and everything they model (services, apps, vms), that doesn't
      > need any complex userspace infrastructure, has good access control
      > (i.e. because the file system enforces that anyway, and there's the
      > "trusted." xattr namespace), notifications (inotify), and can easily be
      > shared among applications.
      >
      > Lennart
      
      v7:
      - no changes
      v6:
      - remove user xattr namespace, only allow trusted and security
      v5:
      - check for capabilities before setting/removing xattrs
      v4:
      - no changes
      v3:
      - instead of config option, use mount option to enable xattr support
      Original-patch-by: NLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      03b1cde6
    • A
      cgroup: revise how we re-populate root directory · 13af07df
      Aristeu Rozanski 提交于
      When remounting cgroupfs with some subsystems added to it and some
      removed, cgroup will remove all the files in root directory and then
      re-popluate it.
      
      What I'm doing here is, only remove files which belong to subsystems that
      are to be unbinded, and only create files for newly-added subsystems.
      The purpose is to have all other files untouched.
      
      This is a preparation for cgroup xattr support.
      
      v7:
      - checkpatch warnings fixed
      v6:
      - no changes
      v5:
      - no changes
      v4:
      - refactored cgroup_clear_directory() to not use cgroup_rm_file()
      - instead of going thru the list of files, get the file list using the
        subsystems
      - use 'subsys_mask' instead of {added,removed}_bits and made
        cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
      v3:
      - refresh patches after recent refactoring
      Original-patch-by: NLi Zefan <lizefan@huawei.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Lennart Poettering <lpoetter@redhat.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13af07df
  4. 14 7月, 2012 2 次提交
  5. 10 7月, 2012 1 次提交
  6. 08 7月, 2012 2 次提交
    • T
      cgroup: fix cgroup hierarchy umount race · 5db9a4d9
      Tejun Heo 提交于
      48ddbe19 "cgroup: make css->refcnt clearing on cgroup removal
      optional" allowed a css to linger after the associated cgroup is
      removed.  As a css holds a reference on the cgroup's dentry, it means
      that cgroup dentries may linger for a while.
      
      Destroying a superblock which has dentries with positive refcnts is a
      critical bug and triggers BUG() in vfs code.  As each cgroup dentry
      holds an s_active reference, any lingering cgroup has both its dentry
      and the superblock pinned and thus preventing premature release of
      superblock.
      
      Unfortunately, after 48ddbe19, there's a small window while
      releasing a cgroup which is directly under the root of the hierarchy.
      When a cgroup directory is released, vfs layer first deletes the
      corresponding dentry and then invokes dput() on the parent, which may
      recurse further, so when a cgroup directly below root cgroup is
      released, the cgroup is first destroyed - which releases the s_active
      it was holding - and then the dentry for the root cgroup is dput().
      
      This creates a window where the root dentry's refcnt isn't zero but
      superblock's s_active is.  If umount happens before or during this
      window, vfs will see the root dentry with non-zero refcnt and trigger
      BUG().
      
      Before 48ddbe19, this problem didn't exist because the last dentry
      reference was guaranteed to be put synchronously from rmdir(2)
      invocation which holds s_active around the whole process.
      
      Fix it by holding an extra superblock->s_active reference across
      dput() from css release, which is the dput() path added by 48ddbe19
      and the only one which doesn't hold an extra s_active ref across the
      final cgroup dput().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4FEEA5CB.8070809@huawei.com>
      Reported-by: Nshyju pv <shyju.pv@huawei.com>
      Tested-by: Nshyju pv <shyju.pv@huawei.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5db9a4d9
    • T
      Revert "cgroup: superblock can't be released with active dentries" · 7db5b3ca
      Tejun Heo 提交于
      This reverts commit fa980ca8.  The
      commit was an attempt to fix a race condition where a cgroup hierarchy
      may be unmounted with positive dentry reference on root cgroup.  While
      the commit made the race condition slightly more difficult to trigger,
      the race was still there and could be reliably triggered using a
      different test case.
      
      Revert the incorrect fix.  The next commit will describe the race and
      fix it correctly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      LKML-Reference: <4FEEA5CB.8070809@huawei.com>
      Reported-by: Nshyju pv <shyju.pv@huawei.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7db5b3ca
  7. 19 6月, 2012 1 次提交
  8. 07 6月, 2012 2 次提交
  9. 30 5月, 2012 1 次提交
  10. 28 5月, 2012 1 次提交
    • T
      cgroup: superblock can't be released with active dentries · fa980ca8
      Tejun Heo 提交于
      48ddbe19 "cgroup: make css->refcnt clearing on cgroup removal
      optional" allowed a css to linger after the associated cgroup is
      removed.  As a css holds a reference on the cgroup's dentry, it means
      that cgroup dentries may linger for a while.
      
      cgroup_create() does grab an active reference on the superblock to
      prevent it from going away while there are !root cgroups; however, the
      reference is put from cgroup_diput() which is invoked on cgroup
      removal, so cgroup dentries which are removed but persisting due to
      lingering csses already have released their superblock active refs
      allowing superblock to be killed while those dentries are around.
      
      Given the right condition, this makes cgroup_kill_sb() call
      kill_litter_super() with dentries with non-zero d_count leading to
      BUG() in shrink_dcache_for_umount_subtree().
      
      Fix it by adding cgroup_dops->d_release() operation and moving
      deactivate_super() to it.  cgroup_diput() now marks dentry->d_fsdata
      with itself if superblock should be deactivated and cgroup_d_release()
      deactivates the superblock on dentry release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Tested-by: NSasha Levin <levinsasha928@gmail.com>
      LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      fa980ca8
  11. 16 5月, 2012 1 次提交
  12. 24 4月, 2012 1 次提交
  13. 12 4月, 2012 1 次提交
  14. 02 4月, 2012 12 次提交
    • T
      cgroup: make css->refcnt clearing on cgroup removal optional · 48ddbe19
      Tejun Heo 提交于
      Currently, cgroup removal tries to drain all css references.  If there
      are active css references, the removal logic waits and retries
      ->pre_detroy() until either all refs drop to zero or removal is
      cancelled.
      
      This semantics is unusual and adds non-trivial complexity to cgroup
      core and IMHO is fundamentally misguided in that it couples internal
      implementation details (references to internal data structure) with
      externally visible operation (rmdir).  To userland, this is a behavior
      peculiarity which is unnecessary and difficult to expect (css refs is
      otherwise invisible from userland), and, to policy implementations,
      this is an unnecessary restriction (e.g. blkcg wants to hold css refs
      for caching purposes but can't as that becomes visible as rmdir hang).
      
      Unfortunately, memcg currently depends on ->pre_destroy() retrials and
      cgroup removal vetoing and can't be immmediately switched to the new
      behavior.  This patch introduces the new behavior of not waiting for
      css refs to drain and maintains the old behavior for subsystems which
      have __DEPRECATED_clear_css_refs set.
      
      Once, memcg is updated, we can drop the code paths for the old
      behavior as proposed in the following patch.  Note that the following
      patch is incorrect in that dput work item is in cgroup and may lose
      some of dputs when multiples css's are released back-to-back, and
      __css_put() triggers check_for_release() when refcnt reaches 0 instead
      of 1; however, it shows what part can be removed.
      
        http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251
      
      Note that, in not-too-distant future, cgroup core will start emitting
      warning messages for subsys which require the old behavior, so please
      get moving.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      48ddbe19
    • T
      cgroup: use negative bias on css->refcnt to block css_tryget() · 28b4c27b
      Tejun Heo 提交于
      When a cgroup is about to be removed, cgroup_clear_css_refs() is
      called to check and ensure that there are no active css references.
      
      This is currently achieved by dropping the refcnt to zero iff it has
      only the base ref.  If all css refs could be dropped to zero, ref
      clearing is successful and CSS_REMOVED is set on all css.  If not, the
      base ref is restored.  While css ref is zero w/o CSS_REMOVED set, any
      css_tryget() attempt on it busy loops so that they are atomic
      w.r.t. the whole css ref clearing.
      
      This does work but dropping and re-instating the base ref is somewhat
      hairy and makes it difficult to add more logic to the put path as
      there are two of them - the regular css_put() and the reversible base
      ref clearing.
      
      This patch updates css ref clearing such that blocking new
      css_tryget() and putting the base ref are separate operations.
      CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
      css_tryget() busy loops while refcnt is negative.  After all css refs
      are deactivated, if they were all one, ref clearing succeeded and
      CSS_REMOVED is set and the base ref is put using the regular
      css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
      and the original postive values are restored.
      
      css_refcnt() accessor which always returns the unbiased positive
      reference counts is added and used to simplify refcnt usages.  While
      at it, relocate and reformat comments in cgroup_has_css_refs().
      
      This separates css->refcnt deactivation and putting the base ref,
      which enables the next patch to make ref clearing optional.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      28b4c27b
    • T
      cgroup: implement cgroup_rm_cftypes() · 79578621
      Tejun Heo 提交于
      Implement cgroup_rm_cftypes() which removes an array of cftypes from a
      subsystem.  It can be called whether the target subsys is attached or
      not.  cgroup core will remove the specified file from all existing
      cgroups.
      
      This will be used to improve sub-subsys modularity and will be helpful
      for unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      79578621
    • T
      cgroup: introduce struct cfent · 05ef1d7c
      Tejun Heo 提交于
      This patch adds cfent (cgroup file entry) which is the association
      between a cgroup and a file.  This is in-cgroup representation of
      files under a cgroup directory.  This simplifies walking walking
      cgroup files and thus cgroup_clear_directory(), which is now
      implemented in two parts - cgroup_rm_file() and a loop around it.
      
      cgroup_rm_file() will be used to implement cftype removal and cfent is
      scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
      "sever" semantics).
      
      v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
            if the file had openers at the time of removal.  Moved to
            cgroup_diput().
      
          - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
            wasn't empty after removing all files.  This triggered
            spuriously if some files were open during directory clearing.
            Removed.
      
      v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
            spuriously triggered for root cgroups because they don't go
            through cgroup_clear_directory() on unmount.  Don't trigger WARN
            for root cgroups.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Glauber Costa <glommer@parallels.com>
      05ef1d7c
    • T
      cgroup: relocate __d_cgrp() and __d_cft() · f6ea9372
      Tejun Heo 提交于
      Move the two macros upwards as they'll be used earlier in the file.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      f6ea9372
    • T
      cgroup: remove cgroup_add_file[s]() · db0416b6
      Tejun Heo 提交于
      No controller is using cgroup_add_files[s]().  Unexport them, and
      convert cgroup_add_files() to handle NULL entry terminated array
      instead of taking count explicitly and continue creation on failure
      for internal use.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      db0416b6
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
    • T
      cgroup: merge cft_release_agent cftype array into the base files array · 6e6ff25b
      Tejun Heo 提交于
      Now that cftype can express whether a file should only be on root,
      cft_release_agent can be merged into the base files cftypes array.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      6e6ff25b
    • T
      cgroup: implement cgroup_add_cftypes() and friends · 8e3f6541
      Tejun Heo 提交于
      Currently, cgroup directories are populated by subsys->populate()
      callback explicitly creating files on each cgroup creation.  This
      level of flexibility isn't needed or desirable.  It provides largely
      unused flexibility which call for abuses while severely limiting what
      the core layer can do through the lack of structure and conventions.
      
      Per each cgroup file type, the only distinction that cgroup users is
      making is whether a cgroup is root or not, which can easily be
      expressed with flags.
      
      This patch introduces cgroup_add_cftypes().  These deal with cftypes
      instead of individual files - controllers indicate that certain types
      of files exist for certain subsystem.  Newly added CFTYPE_*_ON_ROOT
      flags indicate whether a cftype should be excluded or created only on
      the root cgroup.
      
      cgroup_add_cftypes() can be called any time whether the target
      subsystem is currently attached or not.  cgroup core will create files
      on the existing cgroups as necessary.
      
      Also, cgroup_subsys->base_cftypes is added to ease registration of the
      base files for the subsystem.  If non-NULL on subsys init, the cftypes
      pointed to by ->base_cftypes are automatically registered on subsys
      init / load.
      
      Further patches will convert the existing users and remove the file
      based interface.  Note that this interface allows dynamic addition of
      files to an active controller.  This will be used for sub-controller
      modularity and unified hierarchy in the longer term.
      
      This patch implements the new mechanism but doesn't apply it to any
      user.
      
      v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
          cgroup_subsys->base_cftypes, which works better for cgroup_subsys
          which is loaded as module.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      8e3f6541
    • T
      cgroup: build list of all cgroups under a given cgroupfs_root · b0ca5a84
      Tejun Heo 提交于
      Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
      going through cgroup->allcg_node.  The list is protected by
      cgroup_mutex and will be used to improve cgroup file handling.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      b0ca5a84
    • T
      cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() · ff4c8d50
      Tejun Heo 提交于
      cgroup_populate_dir() currently clears all files and then repopulate
      the directory; however, the clearing part is only useful when it's
      called from cgroup_remount().  Relocate the invocation to
      cgroup_remount().
      
      This is to prepare for further cgroup file handling updates.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      ff4c8d50
    • T
      cgroup: deprecate remount option changes · 8b5a5a9d
      Tejun Heo 提交于
      This patch marks the following features for deprecation.
      
      * Rebinding subsys by remount: Never reached useful state - only works
        on empty hierarchies.
      
      * release_agent update by remount: release_agent itself will be
        replaced with conventional fsnotify notification.
      
      v2: Lennart pointed out that "name=" is necessary for mounts w/o any
          controller attached.  Drop "name=" deprecation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Lennart Poettering <mzxreary@0pointer.de>
      8b5a5a9d
  15. 30 3月, 2012 1 次提交
    • T
      cgroup: cgroup_attach_task() could return -errno after success · 8f121918
      Tejun Heo 提交于
      61d1d219 "cgroup: remove extra calls to find_existing_css_set" made
      cgroup_task_migrate() return void.  An unfortunate side effect was
      that cgroup_attach_task() was depending on that function's return
      value to clear its @retval on the success path.  On cgroup mounts
      without any subsystem with ->can_attach() callback,
      cgroup_attach_task() ended up returning @retval without initializing
      it on success.
      
      For some reason, gcc failed to warn about it and it didn't cause
      cgroup_attach_task() to return non-zero value in many cases, probably
      due to difference in register allocation.  When the problem
      materializes, systemd fails to populate /systemd cgroup mount and
      fails to boot.
      
      Fix it by initializing @retval to zero on declaration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJiri Kosina <jkosina@suse.cz>
      LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
      Reviewed-by: NMandeep Singh Baines <msb@chromium.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8f121918
  16. 22 3月, 2012 2 次提交
  17. 21 3月, 2012 1 次提交
  18. 22 2月, 2012 2 次提交
    • F
      cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list · 3ce3230a
      Frederic Weisbecker 提交于
      Walking through the tasklist in cgroup_enable_task_cg_list() inside
      an RCU read side critical section is not enough because:
      
      - RCU is not (yet) safe against while_each_thread()
      
      - If we use only RCU, a forking task that has passed cgroup_post_fork()
        without seeing use_task_css_set_links == 1 is not guaranteed to have
        its child immediately visible in the tasklist if we walk through it
        remotely with RCU. In this case it will be missing in its css_set's
        task list.
      
      Thus we need to traverse the list (unfortunately) under the
      tasklist_lock. It makes us safe against while_each_thread() and also
      make sure we see all forked task that have been added to the tasklist.
      
      As a secondary effect, reading and writing use_task_css_set_links are
      now well ordered against tasklist traversing and modification. The new
      layout is:
      
      CPU 0                                      CPU 1
      
      use_task_css_set_links = 1                write_lock(tasklist_lock)
      read_lock(tasklist_lock)                  add task to tasklist
      do_each_thread() {                        write_unlock(tasklist_lock)
      	add thread to css set links       if (use_task_css_set_links)
      } while_each_thread()                         add thread to css set links
      read_unlock(tasklist_lock)
      
      If CPU 0 traverse the list after the task has been added to the tasklist
      then it is correctly added to the css set links. OTOH if CPU 0 traverse
      the tasklist before the new task had the opportunity to be added to the
      tasklist because it was too early in the fork process, then CPU 1
      catches up and add the task to the css set links after it added the task
      to the tasklist. The right value of use_task_css_set_links is guaranteed
      to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
      the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
      and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
      afterward to return the correct value.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      3ce3230a
    • F
      cgroup: Remove wrong comment on cgroup_enable_task_cg_list() · 9a4b4304
      Frederic Weisbecker 提交于
      Remove the stale comment about RCU protection. Many callers
      (all of them?) of cgroup_enable_task_cg_list() don't seem
      to be in an RCU read side critical section. Besides, RCU is
      not helpful to protect against while_each_thread().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mandeep Singh Baines <msb@chromium.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      9a4b4304