提交 · 1f5320d5972aa50d3e8d2b227b636b370e608359 · openeuler / raspberrypi-kernel

17 10月, 2012 1 次提交

cgroup: notify_on_release may not be triggered in some cases · 1f5320d5

由 Daisuke Nishimura 提交于 10月 04, 2012

notify_on_release must be triggered when the last process in a cgroup is
move to another. But if the first(and only) process in a cgroup is moved to
another, notify_on_release is not triggered.

	# mkdir /cgroup/cpu/SRC
	# mkdir /cgroup/cpu/DST
	#
	# echo 1 >/cgroup/cpu/SRC/notify_on_release
	# echo 1 >/cgroup/cpu/DST/notify_on_release
	#
	# sleep 300 &
	[1] 8629
	#
	# echo 8629 >/cgroup/cpu/SRC/tasks
	# echo 8629 >/cgroup/cpu/DST/tasks
	-> notify_on_release for /SRC must be triggered at this point,
	   but it isn't.

This is because put_css_set() is called before setting CGRP_RELEASABLE
in cgroup_task_migrate(), and is a regression introduce by the
commit:74a1166d(cgroups: make procs file writable), which was merged
into v3.0.

Cc: Ben Blum <bblum@andrew.cmu.edu>
Cc: <stable@vger.kernel.org> # v3.0.x and later
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: NTejun Heo <tj@kernel.org>

1f5320d5

15 9月, 2012 5 次提交

cgroup: mark subsystems with broken hierarchy support and whine if cgroups are nested for them · 8c7f6edb

由 Tejun Heo 提交于 9月 13, 2012

Currently, cgroup hierarchy support is a mess.  cpu related subsystems
behave correctly - configuration, accounting and control on a parent
properly cover its children.  blkio and freezer completely ignore
hierarchy and treat all cgroups as if they're directly under the root
cgroup.  Others show yet different behaviors.

These differing interpretations of cgroup hierarchy make using cgroup
confusing and it impossible to co-mount controllers into the same
hierarchy and obtain sane behavior.

Eventually, we want full hierarchy support from all subsystems and
probably a unified hierarchy.  Users using separate hierarchies
expecting completely different behaviors depending on the mounted
subsystem is deterimental to making any progress on this front.

This patch adds cgroup_subsys.broken_hierarchy and sets it to %true
for controllers which are lacking in hierarchy support.  The goal of
this patch is two-fold.

* Move users away from using hierarchy on currently non-hierarchical
  subsystems, so that implementing proper hierarchy support on those
  doesn't surprise them.

* Keep track of which controllers are broken how and nudge the
  subsystems to implement proper hierarchy support.

For now, start with a single warning message.  We can whine louder
later on.

v2: Fixed a typo spotted by Michal. Warning message updated.

v3: Updated memcg part so that it doesn't generate warning in the
    cases where .use_hierarchy=false doesn't make the behavior
    different from root.use_hierarchy=true.  Fixed a typo spotted by
    Glauber.

v4: Check ->broken_hierarchy after cgroup creation is complete so that
    ->create() can affect the result per Michal.  Dropped unnecessary
    memcg root handling per Michal.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NSerge E. Hallyn <serue@us.ibm.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Thomas Graf <tgraf@suug.ch>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

8c7f6edb

cgroup: Assign subsystem IDs during compile time · 8a8e04df

由 Daniel Wagner 提交于 9月 12, 2012

WARNING: With this change it is impossible to load external built
controllers anymore.

In case where CONFIG_NETPRIO_CGROUP=m and CONFIG_NET_CLS_CGROUP=m is
set, corresponding subsys_id should also be a constant. Up to now,
net_prio_subsys_id and net_cls_subsys_id would be of the type int and
the value would be assigned during runtime.

By switching the macro definition IS_SUBSYS_ENABLED from IS_BUILTIN
to IS_ENABLED, all *_subsys_id will have constant value. That means we
need to remove all the code which assumes a value can be assigned to
net_prio_subsys_id and net_cls_subsys_id.

A close look is necessary on the RCU part which was introduces by
following patch:

  commit f8451725
  Author:	Herbert Xu <herbert@gondor.apana.org.au>  Mon May 24 09:12:34 2010
  Committer:	David S. Miller <davem@davemloft.net>  Mon May 24 09:12:34 2010

  cls_cgroup: Store classid in struct sock

  Tis code was added to init_cgroup_cls()

	  /* We can't use rcu_assign_pointer because this is an int. */
	  smp_wmb();
	  net_cls_subsys_id = net_cls_subsys.subsys_id;

  respectively to exit_cgroup_cls()

	  net_cls_subsys_id = -1;
	  synchronize_rcu();

  and in module version of task_cls_classid()

	  rcu_read_lock();
	  id = rcu_dereference(net_cls_subsys_id);
	  if (id >= 0)
		  classid = container_of(task_subsys_state(p, id),
					 struct cgroup_cls_state, css)->classid;
	  rcu_read_unlock();

Without an explicit explaination why the RCU part is needed. (The
rcu_deference was fixed by exchanging it to rcu_derefence_index_check()
in a later commit, but that is a minor detail.)

So here is my pondering why it was introduced and why it safe to
remove it now. Note that this code was copied over to net_prio the
reasoning holds for that subsystem too.

The idea behind the RCU use for net_cls_subsys_id is to make sure we
get a valid pointer back from task_subsys_state(). task_subsys_state()
is just blindly accessing the subsys array and returning the
pointer. Obviously, passing in -1 as id into task_subsys_state()
returns an invalid value (out of lower bound).

So this code makes sure that only after module is loaded and the
subsystem registered, the id is assigned.

Before unregistering the module all old readers must have left the
critical section. This is done by assigning -1 to the id and issuing a
synchronized_rcu(). Any new readers wont call task_subsys_state()
anymore and therefore it is safe to unregister the subsystem.

The new code relies on the same trick, but it looks at the subsys
pointer return by task_subsys_state() (remember the id is constant
and therefore we allways have a valid index into the subsys
array).

No precautions need to be taken during module loading
module. Eventually, all CPUs will get a valid pointer back from
task_subsys_state() because rebind_subsystem() which is called after
the module init() function will assigned subsys[net_cls_subsys_id] the
newly loaded module subsystem pointer.

When the subsystem is about to be removed, rebind_subsystem() will
called before the module exit() function. In this case,
rebind_subsys() will assign subsys[net_cls_subsys_id] a NULL pointer
and then it calls synchronize_rcu(). All old readers have left by then
the critical section. Any new reader wont access the subsystem
anymore.  At this point we are safe to unregister the subsystem. No
synchronize_rcu() call is needed.
Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

8a8e04df

cgroup: Do not depend on a given order when populating the subsys array · 80f4c877

由 Daniel Wagner 提交于 9月 12, 2012

The *_subsys_id will be used as index to access the subsys. Therefore
we need to care we populate the subsystem at the correct position by
using designated initialization.

With this change we are able to interleave builtin and modules in the subsys
array.
Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

80f4c877

cgroup: Wrap subsystem selection macro · 5fc0b025

由 Daniel Wagner 提交于 9月 12, 2012

Before we are able to define all subsystem ids at compile time we need
a more fine grained control what gets defined when we include
cgroup_subsys.h. For example we define the enums for the subsystems or
to declare for struct cgroup_subsys (builtin subsystem) by including
cgroup_subsys.h and defining SUBSYS accordingly.

Currently, the decision if a subsys is used is defined inside the
header by testing if CONFIG_*=y is true. By moving this test outside
of cgroup_subsys.h we are able to control it on the include level.

This is done by introducing IS_SUBSYS_ENABLED which then is defined
according the task, e.g. is CONFIG_*=y or CONFIG_*=m.
Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

5fc0b025

cgroup: Remove CGROUP_BUILTIN_SUBSYS_COUNT · be45c900

由 Daniel Wagner 提交于 9月 13, 2012

CGROUP_BUILTIN_SUBSYS_COUNT is used as start index or stop index when
looping over the subsys array looking either at the builtin or the
module subsystems. Since all the builtin subsystems have an id which
is lower then CGROUP_BUILTIN_SUBSYS_COUNT we know that any module will
have an id larger than CGROUP_BUILTIN_SUBSYS_COUNT. In short the ids
are sorted.

We are about to change id assignment to happen only at compile time
later in this series. That means we can't rely on the above trick
since all ids will always be defined at compile time. Furthermore,
ordering the builtin subsystems and the module subsystems is not
really necessary.

So we need a different way to know which subsystem is a builtin or a
module one. We can use the subsys[]->module pointer for this. Any
place where we need to know if a subsys is module we just check for
the pointer. If it is NULL then the subsystem is a builtin one.

With this we are able to drop the CGROUP_BUILTIN_SUBSYS_COUNT
enum. Though we need to introduce a temporary placeholder so that we
don't get a compilation error when only CONFIG_CGROUP is selected and
no single controller. An empty enum definition is not valid. Later in
this series we are able to remove the placeholder again.

And with this change we get a fix for this:

kernel/cgroup.c: In function ‘cgroup_load_subsys’:
kernel/cgroup.c:4326:38: warning: array subscript is below array bounds [-Warray-bounds]

when CONFIG_CGROUP=y and no built in controller was enabled.
Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Cc: Gao feng <gaofeng@cn.fujitsu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: John Fastabend <john.r.fastabend@intel.com>
Cc: netdev@vger.kernel.org
Cc: cgroups@vger.kernel.org

be45c900

25 8月, 2012 3 次提交

cgroup: rename subsys_bits to subsys_mask · a1a71b45

由 Aristeu Rozanski 提交于 8月 23, 2012

In a previous discussion, Tejun Heo suggested to rename references to
subsys_bits (added_bits, removed_bits, etc) by something more meaningful.

Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

a1a71b45

cgroup: add xattr support · 03b1cde6

由 Aristeu Rozanski 提交于 8月 23, 2012

This is one of the items in the plumber's wish list.

For use cases:

>> What would the use case be for this?
>
> Attaching meta information to services, in an easily discoverable
> way. For example, in systemd we create one cgroup for each service, and
> could then store data like the main pid of the specific service as an
> xattr on the cgroup itself. That way we'd have almost all service state
> in the cgroupfs, which would make it possible to terminate systemd and
> later restart it without losing any state information. But there's more:
> for example, some very peculiar services cannot be terminated on
> shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
> services in question could just mark that on their cgroup, by setting an
> xattr. On the more desktopy side of things there are other
> possibilities: for example there are plans defining what an application
> is along the lines of a cgroup (i.e. an app being a collection of
> processes). With xattrs one could then attach an icon or human readable
> program name on the cgroup.
>
> The key idea is that this would allow attaching runtime meta information
> to cgroups and everything they model (services, apps, vms), that doesn't
> need any complex userspace infrastructure, has good access control
> (i.e. because the file system enforces that anyway, and there's the
> "trusted." xattr namespace), notifications (inotify), and can easily be
> shared among applications.
>
> Lennart

v7:
- no changes
v6:
- remove user xattr namespace, only allow trusted and security
v5:
- check for capabilities before setting/removing xattrs
v4:
- no changes
v3:
- instead of config option, use mount option to enable xattr support
Original-patch-by: NLi Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

03b1cde6

cgroup: revise how we re-populate root directory · 13af07df

由 Aristeu Rozanski 提交于 8月 23, 2012

When remounting cgroupfs with some subsystems added to it and some
removed, cgroup will remove all the files in root directory and then
re-popluate it.

What I'm doing here is, only remove files which belong to subsystems that
are to be unbinded, and only create files for newly-added subsystems.
The purpose is to have all other files untouched.

This is a preparation for cgroup xattr support.

v7:
- checkpatch warnings fixed
v6:
- no changes
v5:
- no changes
v4:
- refactored cgroup_clear_directory() to not use cgroup_rm_file()
- instead of going thru the list of files, get the file list using the
  subsystems
- use 'subsys_mask' instead of {added,removed}_bits and made
  cgroup_populate_dir() to match the parameters with cgroup_clear_directory()
v3:
- refresh patches after recent refactoring
Original-patch-by: NLi Zefan <lizefan@huawei.com>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: Lennart Poettering <lpoetter@redhat.com>
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NAristeu Rozanski <aris@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

13af07df

14 7月, 2012 2 次提交

VFS: Pass mount flags to sget() · 9249e17f

由 David Howells 提交于 6月 25, 2012

Pass mount flags to sget() so that it can use them in initialising a new
superblock before the set function is called.  They could also be passed to the
compare function.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9249e17f

stop passing nameidata to ->lookup() · 00cd8dd3

由 Al Viro 提交于 6月 10, 2012

Just the flags; only NFS cares even about that, but there are
legitimate uses for such argument.  And getting rid of that
completely would require splitting ->lookup() into a couple
of methods (at least), so let's leave that alone for now...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

00cd8dd3

10 7月, 2012 1 次提交

cgroup: cgroup_rm_files() was calling simple_unlink() with the wrong inode · ce27e317

由 Tejun Heo 提交于 7月 03, 2012

While refactoring cgroup file removal path, 05ef1d7c "cgroup:
introduce struct cfent" incorrectly changed the @dir argument of
simple_unlink() to the inode of the file being deleted instead of that
of the containing directory.

The effect of this bug is minor - ctime and mtime of the parent
weren't properly updated on file deletion.

Fix it by using @cgrp->dentry->d_inode instead.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
Acked-by: NLi Zefan <lizefan@huawei.com>
Cc: stable@vger.kernel.org

ce27e317

08 7月, 2012 2 次提交

cgroup: fix cgroup hierarchy umount race · 5db9a4d9

由 Tejun Heo 提交于 7月 07, 2012

48ddbe19 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed.  As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

Destroying a superblock which has dentries with positive refcnts is a
critical bug and triggers BUG() in vfs code.  As each cgroup dentry
holds an s_active reference, any lingering cgroup has both its dentry
and the superblock pinned and thus preventing premature release of
superblock.

Unfortunately, after 48ddbe19, there's a small window while
releasing a cgroup which is directly under the root of the hierarchy.
When a cgroup directory is released, vfs layer first deletes the
corresponding dentry and then invokes dput() on the parent, which may
recurse further, so when a cgroup directly below root cgroup is
released, the cgroup is first destroyed - which releases the s_active
it was holding - and then the dentry for the root cgroup is dput().

This creates a window where the root dentry's refcnt isn't zero but
superblock's s_active is.  If umount happens before or during this
window, vfs will see the root dentry with non-zero refcnt and trigger
BUG().

Before 48ddbe19, this problem didn't exist because the last dentry
reference was guaranteed to be put synchronously from rmdir(2)
invocation which holds s_active around the whole process.

Fix it by holding an extra superblock->s_active reference across
dput() from css release, which is the dput() path added by 48ddbe19
and the only one which doesn't hold an extra s_active ref across the
final cgroup dput().
Signed-off-by: NTejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: Nshyju pv <shyju.pv@huawei.com>
Tested-by: Nshyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

5db9a4d9

Revert "cgroup: superblock can't be released with active dentries" · 7db5b3ca

由 Tejun Heo 提交于 7月 07, 2012

This reverts commit fa980ca8.  The
commit was an attempt to fix a race condition where a cgroup hierarchy
may be unmounted with positive dentry reference on root cgroup.  While
the commit made the race condition slightly more difficult to trigger,
the race was still there and could be reliably triggered using a
different test case.

Revert the incorrect fix.  The next commit will describe the race and
fix it correctly.
Signed-off-by: NTejun Heo <tj@kernel.org>
LKML-Reference: <4FEEA5CB.8070809@huawei.com>
Reported-by: Nshyju pv <shyju.pv@huawei.com>
Cc: Sasha Levin <levinsasha928@gmail.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

7db5b3ca

19 6月, 2012 1 次提交

cgroups: Account for CSS_DEACT_BIAS in __css_put · 8e3bbf42

由 Salman Qazi 提交于 6月 14, 2012

When we fixed the race between atomic_dec and css_refcnt, we missed
the fact that css_refcnt internally subtracts CSS_DEACT_BIAS to get
the actual reference count.  This can potentially cause a refcount leak
if __css_put races with cgroup_clear_css_refs.
Signed-off-by: NSalman Qazi <sqazi@google.com>
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

8e3bbf42

07 6月, 2012 2 次提交

cgroup: remove hierarchy_mutex · 6be96a5c

由 Li Zefan 提交于 6月 06, 2012

It was introduced for memcg to iterate cgroup hierarchy without
holding cgroup_mutex, but soon after that it was replaced with
a lockless way in memcg.

No one used hierarchy_mutex since that, so remove it.
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

6be96a5c

cgroup: make sure that decisions in __css_put are atomic · 967db0ea

由 Salman Qazi 提交于 6月 06, 2012

__css_put is using atomic_dec on the ref count, and then
looking at the ref count to make decisions.  This is prone
to races, as someone else may decrement ref count between
our decrement and our decision.  Instead, we should base our
decisions on the value that we decremented the ref count to.

(This results in an actual race on Google's kernel which I
haven't been able to reproduce on the upstream kernel.  Having
said that, it's still incorrect by inspection).
Signed-off-by: NSalman Qazi <sqazi@google.com>
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org

967db0ea

30 5月, 2012 1 次提交

kernel: cgroup: push rcu read locking from css_is_ancestor() to callsite · 91c63734

由 Johannes Weiner 提交于 5月 29, 2012

Library functions should not grab locks when the callsites can do it,
even if the lock nests like the rcu read-side lock does.

Push the rcu_read_lock() from css_is_ancestor() to its single user,
mem_cgroup_same_or_subtree() in preparation for another user that may
already hold the rcu read-side lock.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

91c63734

28 5月, 2012 1 次提交

cgroup: superblock can't be released with active dentries · fa980ca8

由 Tejun Heo 提交于 5月 24, 2012

48ddbe19 "cgroup: make css->refcnt clearing on cgroup removal
optional" allowed a css to linger after the associated cgroup is
removed.  As a css holds a reference on the cgroup's dentry, it means
that cgroup dentries may linger for a while.

cgroup_create() does grab an active reference on the superblock to
prevent it from going away while there are !root cgroups; however, the
reference is put from cgroup_diput() which is invoked on cgroup
removal, so cgroup dentries which are removed but persisting due to
lingering csses already have released their superblock active refs
allowing superblock to be killed while those dentries are around.

Given the right condition, this makes cgroup_kill_sb() call
kill_litter_super() with dentries with non-zero d_count leading to
BUG() in shrink_dcache_for_umount_subtree().

Fix it by adding cgroup_dops->d_release() operation and moving
deactivate_super() to it.  cgroup_diput() now marks dentry->d_fsdata
with itself if superblock should be deactivated and cgroup_d_release()
deactivates the superblock on dentry release.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NSasha Levin <levinsasha928@gmail.com>
Tested-by: NSasha Levin <levinsasha928@gmail.com>
LKML-Reference: <CA+1xoqe5hMuxzCRhMy7J0XchDk2ZnuxOHJKikROk1-ReAzcT6g@mail.gmail.com>
Acked-by: NLi Zefan <lizefan@huawei.com>

fa980ca8

16 5月, 2012 1 次提交
- E
  userns: Convert cgroup permission checks to use uid_eq · 14a590c3
  由 Eric W. Biederman 提交于 3月 12, 2012
```
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
```
  14a590c3
24 4月, 2012 1 次提交

cgroups: disallow attaching kthreadd or PF_THREAD_BOUND threads · c4c27fbd

由 Mike Galbraith 提交于 4月 21, 2012

Allowing kthreadd to be moved to a non-root group makes no sense, it being
a global resource, and needlessly leads unsuspecting users toward trouble.

1. An RT workqueue worker thread spawned in a task group with no rt_runtime
allocated is not schedulable.  Simple user error, but harmful to the box.

2. A worker thread which acquires PF_THREAD_BOUND can never leave a cpuset,
rendering the cpuset immortal.

Save the user some unexpected trouble, just say no.
Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

c4c27fbd

12 4月, 2012 1 次提交

cgroup: remove cgroup_subsys->populate() · 86f82d56

由 Tejun Heo 提交于 4月 10, 2012

With memcg converted, cgroup_subsys->populate() doesn't have any user
left.  Remove it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

86f82d56

02 4月, 2012 12 次提交

cgroup: make css->refcnt clearing on cgroup removal optional · 48ddbe19

由 Tejun Heo 提交于 4月 01, 2012

Currently, cgroup removal tries to drain all css references.  If there
are active css references, the removal logic waits and retries
->pre_detroy() until either all refs drop to zero or removal is
cancelled.

This semantics is unusual and adds non-trivial complexity to cgroup
core and IMHO is fundamentally misguided in that it couples internal
implementation details (references to internal data structure) with
externally visible operation (rmdir).  To userland, this is a behavior
peculiarity which is unnecessary and difficult to expect (css refs is
otherwise invisible from userland), and, to policy implementations,
this is an unnecessary restriction (e.g. blkcg wants to hold css refs
for caching purposes but can't as that becomes visible as rmdir hang).

Unfortunately, memcg currently depends on ->pre_destroy() retrials and
cgroup removal vetoing and can't be immmediately switched to the new
behavior.  This patch introduces the new behavior of not waiting for
css refs to drain and maintains the old behavior for subsystems which
have __DEPRECATED_clear_css_refs set.

Once, memcg is updated, we can drop the code paths for the old
behavior as proposed in the following patch.  Note that the following
patch is incorrect in that dput work item is in cgroup and may lose
some of dputs when multiples css's are released back-to-back, and
__css_put() triggers check_for_release() when refcnt reaches 0 instead
of 1; however, it shows what part can be removed.

  http://thread.gmane.org/gmane.linux.kernel.containers/22559/focus=75251

Note that, in not-too-distant future, cgroup core will start emitting
warning messages for subsys which require the old behavior, so please
get moving.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

48ddbe19

cgroup: use negative bias on css->refcnt to block css_tryget() · 28b4c27b

由 Tejun Heo 提交于 4月 01, 2012

When a cgroup is about to be removed, cgroup_clear_css_refs() is
called to check and ensure that there are no active css references.

This is currently achieved by dropping the refcnt to zero iff it has
only the base ref.  If all css refs could be dropped to zero, ref
clearing is successful and CSS_REMOVED is set on all css.  If not, the
base ref is restored.  While css ref is zero w/o CSS_REMOVED set, any
css_tryget() attempt on it busy loops so that they are atomic
w.r.t. the whole css ref clearing.

This does work but dropping and re-instating the base ref is somewhat
hairy and makes it difficult to add more logic to the put path as
there are two of them - the regular css_put() and the reversible base
ref clearing.

This patch updates css ref clearing such that blocking new
css_tryget() and putting the base ref are separate operations.
CSS_DEACT_BIAS, defined as INT_MIN, is added to css->refcnt and
css_tryget() busy loops while refcnt is negative.  After all css refs
are deactivated, if they were all one, ref clearing succeeded and
CSS_REMOVED is set and the base ref is put using the regular
css_put(); otherwise, CSS_DEACT_BIAS is subtracted from the refcnts
and the original postive values are restored.

css_refcnt() accessor which always returns the unbiased positive
reference counts is added and used to simplify refcnt usages.  While
at it, relocate and reformat comments in cgroup_has_css_refs().

This separates css->refcnt deactivation and putting the base ref,
which enables the next patch to make ref clearing optional.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

28b4c27b

cgroup: implement cgroup_rm_cftypes() · 79578621

由 Tejun Heo 提交于 4月 01, 2012

Implement cgroup_rm_cftypes() which removes an array of cftypes from a
subsystem.  It can be called whether the target subsys is attached or
not.  cgroup core will remove the specified file from all existing
cgroups.

This will be used to improve sub-subsys modularity and will be helpful
for unified hierarchy.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

79578621

cgroup: introduce struct cfent · 05ef1d7c

由 Tejun Heo 提交于 4月 01, 2012

This patch adds cfent (cgroup file entry) which is the association
between a cgroup and a file.  This is in-cgroup representation of
files under a cgroup directory.  This simplifies walking walking
cgroup files and thus cgroup_clear_directory(), which is now
implemented in two parts - cgroup_rm_file() and a loop around it.

cgroup_rm_file() will be used to implement cftype removal and cfent is
scheduled to serve cgroup specific per-file data (e.g. for sysfs-like
"sever" semantics).

v2: - cfe was freed from cgroup_rm_file() which led to use-after-free
      if the file had openers at the time of removal.  Moved to
      cgroup_diput().

    - cgroup_clear_directory() triggered WARN_ON_ONCE() if d_subdirs
      wasn't empty after removing all files.  This triggered
      spuriously if some files were open during directory clearing.
      Removed.

v3: - In cgroup_diput(), WARN_ONCE(!list_empty(&cfe->node)) could be
      spuriously triggered for root cgroups because they don't go
      through cgroup_clear_directory() on unmount.  Don't trigger WARN
      for root cgroups.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Glauber Costa <glommer@parallels.com>

05ef1d7c

cgroup: relocate __d_cgrp() and __d_cft() · f6ea9372

由 Tejun Heo 提交于 4月 01, 2012

Move the two macros upwards as they'll be used earlier in the file.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

f6ea9372

cgroup: remove cgroup_add_file[s]() · db0416b6

由 Tejun Heo 提交于 4月 01, 2012

No controller is using cgroup_add_files[s]().  Unexport them, and
convert cgroup_add_files() to handle NULL entry terminated array
instead of taking count explicitly and continue creation on failure
for internal use.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

db0416b6

cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33

由 Tejun Heo 提交于 4月 01, 2012

Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
net_cls and device controllers to use the new cftype based interface.
Termination entry is added to cftype arrays and populate callbacks are
replaced with cgroup_subsys->base_cftypes initializations.

This is functionally identical transformation.  There shouldn't be any
visible behavior change.

memcg is rather special and will be converted separately.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <paul@paulmenage.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Vivek Goyal <vgoyal@redhat.com>

4baf6e33

cgroup: merge cft_release_agent cftype array into the base files array · 6e6ff25b

由 Tejun Heo 提交于 4月 01, 2012

Now that cftype can express whether a file should only be on root,
cft_release_agent can be merged into the base files cftypes array.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

6e6ff25b

cgroup: implement cgroup_add_cftypes() and friends · 8e3f6541

由 Tejun Heo 提交于 4月 01, 2012

Currently, cgroup directories are populated by subsys->populate()
callback explicitly creating files on each cgroup creation.  This
level of flexibility isn't needed or desirable.  It provides largely
unused flexibility which call for abuses while severely limiting what
the core layer can do through the lack of structure and conventions.

Per each cgroup file type, the only distinction that cgroup users is
making is whether a cgroup is root or not, which can easily be
expressed with flags.

This patch introduces cgroup_add_cftypes().  These deal with cftypes
instead of individual files - controllers indicate that certain types
of files exist for certain subsystem.  Newly added CFTYPE_*_ON_ROOT
flags indicate whether a cftype should be excluded or created only on
the root cgroup.

cgroup_add_cftypes() can be called any time whether the target
subsystem is currently attached or not.  cgroup core will create files
on the existing cgroups as necessary.

Also, cgroup_subsys->base_cftypes is added to ease registration of the
base files for the subsystem.  If non-NULL on subsys init, the cftypes
pointed to by ->base_cftypes are automatically registered on subsys
init / load.

Further patches will convert the existing users and remove the file
based interface.  Note that this interface allows dynamic addition of
files to an active controller.  This will be used for sub-controller
modularity and unified hierarchy in the longer term.

This patch implements the new mechanism but doesn't apply it to any
user.

v2: replaced DECLARE_CGROUP_CFTYPES[_COND]() with
    cgroup_subsys->base_cftypes, which works better for cgroup_subsys
    which is loaded as module.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

8e3f6541

cgroup: build list of all cgroups under a given cgroupfs_root · b0ca5a84

由 Tejun Heo 提交于 4月 01, 2012

Build a list of all cgroups anchored at cgroupfs_root->allcg_list and
going through cgroup->allcg_node.  The list is protected by
cgroup_mutex and will be used to improve cgroup file handling.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

b0ca5a84

cgroup: move cgroup_clear_directory() call out of cgroup_populate_dir() · ff4c8d50

由 Tejun Heo 提交于 4月 01, 2012

cgroup_populate_dir() currently clears all files and then repopulate
the directory; however, the clearing part is only useful when it's
called from cgroup_remount().  Relocate the invocation to
cgroup_remount().

This is to prepare for further cgroup file handling updates.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>

ff4c8d50

cgroup: deprecate remount option changes · 8b5a5a9d

由 Tejun Heo 提交于 4月 01, 2012

This patch marks the following features for deprecation.

* Rebinding subsys by remount: Never reached useful state - only works
  on empty hierarchies.

* release_agent update by remount: release_agent itself will be
  replaced with conventional fsnotify notification.

v2: Lennart pointed out that "name=" is necessary for mounts w/o any
    controller attached.  Drop "name=" deprecation.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Lennart Poettering <mzxreary@0pointer.de>

8b5a5a9d

30 3月, 2012 1 次提交

cgroup: cgroup_attach_task() could return -errno after success · 8f121918

由 Tejun Heo 提交于 3月 29, 2012

61d1d219 "cgroup: remove extra calls to find_existing_css_set" made
cgroup_task_migrate() return void.  An unfortunate side effect was
that cgroup_attach_task() was depending on that function's return
value to clear its @retval on the success path.  On cgroup mounts
without any subsystem with ->can_attach() callback,
cgroup_attach_task() ended up returning @retval without initializing
it on success.

For some reason, gcc failed to warn about it and it didn't cause
cgroup_attach_task() to return non-zero value in many cases, probably
due to difference in register allocation.  When the problem
materializes, systemd fails to populate /systemd cgroup mount and
fails to boot.

Fix it by initializing @retval to zero on declaration.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJiri Kosina <jkosina@suse.cz>
LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
Reviewed-by: NMandeep Singh Baines <msb@chromium.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

8f121918

22 3月, 2012 2 次提交

memcg: let css_get_next() rely upon rcu_read_lock() · ca464d69

由 Hugh Dickins 提交于 3月 21, 2012

Remove lock and unlock around css_get_next()'s call to idr_get_next().
memcg iterators (only users of css_get_next) already did rcu_read_lock(),
and its comment demands that; but add a WARN_ON_ONCE to make sure of it.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ca464d69

cgroup: revert ss_id_lock to spinlock · 42aee6c4

由 Hugh Dickins 提交于 3月 21, 2012

Commit c1e2ee2d ("memcg: replace ss->id_lock with a rwlock") has now
been seen to cause the unfair behavior we should have expected from
converting a spinlock to an rwlock: softlockup in cgroup_mkdir(), whose
get_new_cssid() is waiting for the wlock, while there are 19 tasks using
the rlock in css_get_next() to get on with their memcg workload (in an
artificial test, admittedly).  Yet lib/idr.c was made suitable for RCU
way back: revert that commit, restoring ss->id_lock to a spinlock.
Signed-off-by: NHugh Dickins <hughd@google.com>
Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

42aee6c4

21 3月, 2012 1 次提交
- A
  switch open-coded instances of d_make_root() to new helper · 48fde701
  由 Al Viro 提交于 1月 08, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  48fde701
22 2月, 2012 2 次提交

cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list · 3ce3230a

由 Frederic Weisbecker 提交于 2月 08, 2012

Walking through the tasklist in cgroup_enable_task_cg_list() inside
an RCU read side critical section is not enough because:

- RCU is not (yet) safe against while_each_thread()

- If we use only RCU, a forking task that has passed cgroup_post_fork()
  without seeing use_task_css_set_links == 1 is not guaranteed to have
  its child immediately visible in the tasklist if we walk through it
  remotely with RCU. In this case it will be missing in its css_set's
  task list.

Thus we need to traverse the list (unfortunately) under the
tasklist_lock. It makes us safe against while_each_thread() and also
make sure we see all forked task that have been added to the tasklist.

As a secondary effect, reading and writing use_task_css_set_links are
now well ordered against tasklist traversing and modification. The new
layout is:

CPU 0                                      CPU 1

use_task_css_set_links = 1                write_lock(tasklist_lock)
read_lock(tasklist_lock)                  add task to tasklist
do_each_thread() {                        write_unlock(tasklist_lock)
	add thread to css set links       if (use_task_css_set_links)
} while_each_thread()                         add thread to css set links
read_unlock(tasklist_lock)

If CPU 0 traverse the list after the task has been added to the tasklist
then it is correctly added to the css set links. OTOH if CPU 0 traverse
the tasklist before the new task had the opportunity to be added to the
tasklist because it was too early in the fork process, then CPU 1
catches up and add the task to the css set links after it added the task
to the tasklist. The right value of use_task_css_set_links is guaranteed
to be visible from CPU 1 due to the LOCK/UNLOCK implicit barrier properties:
the read_unlock on CPU 0 makes the write on use_task_css_set_links happening
and the write_lock on CPU 1 make the read of use_task_css_set_links that comes
afterward to return the correct value.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

3ce3230a

cgroup: Remove wrong comment on cgroup_enable_task_cg_list() · 9a4b4304

由 Frederic Weisbecker 提交于 2月 08, 2012

Remove the stale comment about RCU protection. Many callers
(all of them?) of cgroup_enable_task_cg_list() don't seem
to be in an RCU read side critical section. Besides, RCU is
not helpful to protect against while_each_thread().
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Mandeep Singh Baines <msb@chromium.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

9a4b4304