提交 · 69e943b7d3c2dcca1087e03e556ac6cb0d4433b4 · openanolis / cloud-kernel

08 2月, 2014 4 次提交

cgroup: update locking in cgroup_show_options() · 69e943b7

由 Tejun Heo 提交于 2月 08, 2014

cgroup_show_options() grabs cgroup_root_mutex to protect the options
changing while printing; however, holding root_mutex or not doesn't
really make much difference for the function.  subsys_mask can be
atomically tested and most of the options aren't allowed to change
anyway once mounted.

The only field which needs synchronization is ->release_agent_path.
This patch introduces a dedicated spinlock to synchronize accesses to
the field and drops cgroup_root_mutex locking from
cgroup_show_options().  The next patch will remove cgroup_root_mutex.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

69e943b7

cgroup: rename cgroup_subsys->subsys_id to ->id · aec25020

由 Tejun Heo 提交于 2月 08, 2014

It's no longer referenced outside cgroup core, so renaming is easy.
Let's rename it for consistency & brevity.

This patch is pure rename.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

aec25020

cgroup: clean up cgroup_subsys names and initialization · 073219e9

由 Tejun Heo 提交于 2月 08, 2014

cgroup_subsys is a bit messier than it needs to be.

* The name of a subsys can be different from its internal identifier
  defined in cgroup_subsys.h.  Most subsystems use the matching name
  but three - cpu, memory and perf_event - use different ones.

* cgroup_subsys_id enums are postfixed with _subsys_id and each
  cgroup_subsys is postfixed with _subsys.  cgroup.h is widely
  included throughout various subsystems, it doesn't and shouldn't
  have claim on such generic names which don't have any qualifier
  indicating that they belong to cgroup.

* cgroup_subsys->subsys_id should always equal the matching
  cgroup_subsys_id enum; however, we require each controller to
  initialize it and then BUG if they don't match, which is a bit
  silly.

This patch cleans up cgroup_subsys names and initialization by doing
the followings.

* cgroup_subsys_id enums are now postfixed with _cgrp_id, and each
  cgroup_subsys with _cgrp_subsys.

* With the above, renaming subsys identifiers to match the userland
  visible names doesn't cause any naming conflicts.  All non-matching
  identifiers are renamed to match the official names.

  cpu_cgroup -> cpu
  mem_cgroup -> memory
  perf -> perf_event

* controllers no longer need to initialize ->subsys_id and ->name.
  They're generated in cgroup core and set automatically during boot.

* Redundant cgroup_subsys declarations removed.

* While updating BUG_ON()s in cgroup_init_early(), convert them to
  WARN()s.  BUGging that early during boot is stupid - the kernel
  can't print anything, even through serial console and the trap
  handler doesn't even link stack frame properly for back-tracing.

This patch doesn't introduce any behavior changes.

v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
    classid handling into core").
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NNeil Horman <nhorman@tuxdriver.com>
Acked-by: N"David S. Miller" <davem@davemloft.net>
Acked-by: N"Rafael J. Wysocki" <rjw@rjwysocki.net>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NAristeu Rozanski <aris@redhat.com>
Acked-by: NIngo Molnar <mingo@redhat.com>
Acked-by: NLi Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Serge E. Hallyn <serue@us.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Thomas Graf <tgraf@suug.ch>

073219e9

cgroup: drop module support · 3ed80a62

由 Tejun Heo 提交于 2月 08, 2014

With module supported dropped from net_prio, no controller is using
cgroup module support.  None of actual resource controllers can be
built as a module and we aren't gonna add new controllers which don't
control resources.  This patch drops module support from cgroup.

* cgroup_[un]load_subsys() and cgroup_subsys->module removed.

* As there's no point in distinguishing IS_BUILTIN() and IS_MODULE(),
  cgroup_subsys.h now uses IS_ENABLED() directly.

* enum cgroup_subsys_id now exactly matches the list of enabled
  controllers as ordered in cgroup_subsys.h.

* cgroup_subsys[] is now a contiguously occupied array.  Size
  specification is no longer necessary and dropped.

* for_each_builtin_subsys() is removed and for_each_subsys() is
  updated to not require any locking.

* module ref handling is removed from rebind_subsystems().

* Module related comments dropped.

v2: Rebased on top of fe1217c4 ("net: net_cls: move cgroupfs
    classid handling into core").

v3: Added {} around the if (need_forkexit_callback) block in
    cgroup_post_fork() for readability as suggested by Li.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

3ed80a62

18 1月, 2014 1 次提交

cgroup: trivial style updates · dd4b0a46

由 SeongJae Park 提交于 1月 18, 2014

* Place newline before function opening brace in cgroup_kill_sb().

* Insert space before assignment in attach_task_by_pid()

tj: merged two patches into one.
Signed-off-by: NSeongJae Park <sj38.park@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

dd4b0a46

17 12月, 2013 1 次提交

cgroup: don't recycle cgroup id until all csses' have been destroyed · c1a71504

由 Li Zefan 提交于 12月 17, 2013

Hugh reported this bug:

> CONFIG_MEMCG_SWAP is broken in 3.13-rc.  Try something like this:
>
> mkdir -p /tmp/tmpfs /tmp/memcg
> mount -t tmpfs -o size=1G tmpfs /tmp/tmpfs
> mount -t cgroup -o memory memcg /tmp/memcg
> mkdir /tmp/memcg/old
> echo 512M >/tmp/memcg/old/memory.limit_in_bytes
> echo $$ >/tmp/memcg/old/tasks
> cp /dev/zero /tmp/tmpfs/zero 2>/dev/null
> echo $$ >/tmp/memcg/tasks
> rmdir /tmp/memcg/old
> sleep 1	# let rmdir work complete
> mkdir /tmp/memcg/new
> umount /tmp/tmpfs
> dmesg | grep WARNING
> rmdir /tmp/memcg/new
> umount /tmp/memcg
>
> Shows lots of WARNING: CPU: 1 PID: 1006 at kernel/res_counter.c:91
>                            res_counter_uncharge_locked+0x1f/0x2f()
>
> Breakage comes from 34c00c31 ("memcg: convert to use cgroup id").
>
> The lifetime of a cgroup id is different from the lifetime of the
> css id it replaced: memsw's css_get()s do nothing to hold on to the
> old cgroup id, it soon gets recycled to a new cgroup, which then
> mysteriously inherits the old's swap, without any charge for it.

Instead of removing cgroup id right after all the csses have been
offlined, we should do that after csses have been destroyed.

To make sure an invalid css pointer won't be returned after the css
is destroyed, make sure css_from_id() returns NULL in this case.

tj: Updated comment to note planned changes for cgrp->id.
Reported-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Signed-off-by: NTejun Heo <tj@kernel.org>

c1a71504

14 12月, 2013 1 次提交

cgroup: fix fail path in cgroup_load_subsys() · 10bf2f7e

由 Vladimir Davydov 提交于 12月 12, 2013

Calling cgroup_unload_subsys() from cgroup_load_subsys() after
online_css() failure will result in a NULL ptr dereference on attempt to
offline_css(), because online_css() only assigns css to cgroup on
success. Let's fix that by skipping calls to offline_css() and
css_free() in cgroup_unload_subsys() if there is no css, and freeing css
in cgroup_load_subsys() on online_css() failure.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

10bf2f7e

12 12月, 2013 1 次提交

cgroup: fix missing unlock on error in cgroup_load_subsys() · 0be8669d

由 Wei Yongjun 提交于 12月 09, 2013

Add the missing unlock before return from function cgroup_load_subsys()
in the error handling case.
Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

0be8669d

07 12月, 2013 8 次提交

cgroup: remove for_each_root_subsys() · b85d2040

由 Tejun Heo 提交于 12月 06, 2013

After the previous patch which introduced for_each_css(),
for_each_root_subsys() only has two users left.  This patch replaces
it with for_each_subsys() + explicit subsys_mask testing and remove
for_each_root_subsys() along with cgroupfs_root->subsys_list handling.

This patch doesn't introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

b85d2040

cgroup: implement for_each_css() · 1c6727af

由 Tejun Heo 提交于 12月 06, 2013

There are enough places where css's of a cgroup are iterated, which
currently uses for_each_root_subsys() + explicit cgroup_css().  This
patch implements for_each_css() and replaces the above combination
with it.

This patch doesn't introduce any behavior changes.

v2: Updated to apply cleanly on top of v2 of "cgroup: fix css leaks on
    online_css() failure"
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

1c6727af

cgroup: factor out cgroup_subsys_state creation into create_css() · c81c925a

由 Tejun Heo 提交于 12月 06, 2013

Now that all opertations to create a css (cgroup_subsys_state) are
collected into a single loop in cgroup_create(), it's easy to factor
it out into its own function.  Factor out css creation into
create_css().  This makes the code easier to follow and will enable
decoupling css creation from cgroup creation which is necessary for
the planned unified hierarchy.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

c81c925a

cgroup: combine css handling loops in cgroup_create() · 9d403e99

由 Tejun Heo 提交于 12月 06, 2013

Now that css operations in cgroup_create() are back-to-back, there
isn't much point in allocating css's in one loop and onlining them in
another.  Merge the two loops so that a css is allocated and onlined
on each iteration.

css_ar[] is no longer necessary and replaced with a single pointer.
This also simplifies the error handling path.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

9d403e99

cgroup: reorder operations in cgroup_create() · 0d80255e

由 Tejun Heo 提交于 12月 06, 2013

cgroup_create() currently does the followings.

1. alloc cgroup
2. alloc css's
3. create the directory and commit to cgroup creation
4. online css's
5. create cgroup and css files

The sequence performs allocations before other operations but it
doesn't buy anything because each of the above steps may fail and
should be unrollable.  Reorganize the sequence such that cgroup
operations are done before css operations.

1. alloc cgroup
2. create the directory and files and commit to cgroup creation
3. alloc css's
4. create files for and online css's

This simplifies the code a bit and enables further simplification and
separating out css creation from cgroup creation which is necessary
for the planned unified hierarchy where css's will be created and
destroyed dynamically across the lifetime of a cgroup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

0d80255e

cgroup: make for_each_subsys() useable under cgroup_root_mutex · 780cd8b3

由 Tejun Heo 提交于 12月 06, 2013

We want to use for_each_subsys() in cgroupfs_root handling where only
cgroup_root_mutex is held.  The only way cgroup_subsys[] can change is
through module load/unload, make cgroup_[un]load_subsys() grab
cgroup_root_mutex too and update the lockdep annotation in
for_each_subsys() to allow either cgroup_mutex or cgroup_root_mutex.

* Lockdep annotation is moved from inner 'if' condition to outer 'for'
  init caluse.  There's no reason to execute the assertion every loop.

* Loop index @i is renamed to @ssid.  Indices iterating through subsys
  will be [re]named to @ssid gradually.

v2: cgroup_assert_mutex_or_root_locked() caused build failure if
    !CONFIG_LOCKEDP.  Conditionalize its definition.  The build failure
    was reported by kbuild test bot.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Cc: kbuild test robot <fengguang.wu@intel.com>

780cd8b3

cgroup: css iterations and css_from_dir() are safe under cgroup_mutex · 87fb54f1

由 Tejun Heo 提交于 12月 06, 2013

Currently, all css iterations and css_from_dir() require RCU read lock
whether the caller is holding cgroup_mutex or not, which is
unnecessarily restrictive.  They are all safe to use under
cgroup_mutex without holding RCU read lock.

Factor out cgroup_assert_mutex_or_rcu_locked() from css_from_id() and
apply it to all css iteration functions and css_from_dir().

v2: cgroup_assert_mutex_or_rcu_locked() definition doesn't need to be
    inside CONFIG_PROVE_RCU ifdef as rcu_lockdep_assert() is always
    defined and conditionalized.  Move it outside of the ifdef block.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

87fb54f1

cgroup: fix cgroup_create() error handling path · 266ccd50

由 Tejun Heo 提交于 12月 06, 2013

ae7f164a ("cgroup: move cgroup->subsys[] assignment to
online_css()") moved cgroup->subsys[] assignements later in
cgroup_create() but didn't update error handling path accordingly
leading to the following oops and leaking later css's after an
online_css() failure.  The oops is from cgroup destruction path being
invoked on the partially constructed cgroup which is not ready to
handle empty slots in cgrp->subsys[] array.

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
  IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
  PGD a780a067 PUD aadbe067 PMD 0
  Oops: 0000 [#1] SMP
  Modules linked in:
  CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
  Hardware name:
  task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
  RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
  RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
  RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
  RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
  RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
  R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
  R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
  FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
  Stack:
   ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
   ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
   ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
  Call Trace:
   [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
   [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
   [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
   [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
   [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b

This patch moves reference bumping inside online_css() loop, clears
css_ar[] as css's are brought online successfully, and updates
err_destroy path so that either a css is fully online and destroyed by
cgroup_destroy_locked() or the error path frees it.  This creates a
duplicate css free logic in the error path but it will be cleaned up
soon.

v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
    invoked with a cgroup which doesn't have all css's populated.
    Update cgroup_destroy_locked() so that it skips NULL css's.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Reported-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: stable@vger.kernel.org # v3.12+

266ccd50

06 12月, 2013 7 次提交

cgroup: unify pidlist and other file handling · 6612f05b

由 Tejun Heo 提交于 12月 05, 2013

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  With the previous
changes, the difference between pidlist and other files are very
small.  Both are served by seq_file in a pretty standard way with the
only difference being !pidlist files use single_open().

This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
implements the matching cgroup_seqfile_start/next/stop() which either
emulates single_open() behavior or invokes cftype->seq_*() operations
if specified.  This allows using single seq_operations for both
pidlist and other files and makes cgroup_pidlist_operations and
cgorup_pidlist_open() no longer necessary.  As cgroup_pidlist_open()
was the only user of cftype->open(), the method is dropped together.

This brings cftype file interface very close to kernfs interface and
mapping shouldn't be too difficult.  Once converted to kernfs, most of
the plumbing code including cgroup_seqfile_*() will be removed as
kernfs provides those facilities.

This patch does not introduce any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".

v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
    to all cgroup files".
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

6612f05b

cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82

由 Tejun Heo 提交于 12月 05, 2013

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch
replaces cftype->read_seq_string() with cftype->seq_show() which is
not limited to single_open() operation and will map directcly to
kernfs seq_file interface.

The conversions are mechanical.  As ->seq_show() doesn't have @css and
@cft, the functions which make use of them are converted to use
seq_css() and seq_cft() respectively.  In several occassions, e.f. if
it has seq_string in its name, the function name is updated to fit the
new method better.

This patch does not introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAristeu Rozanski <arozansk@redhat.com>
Acked-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Acked-by: NLi Zefan <lizefan@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Neil Horman <nhorman@tuxdriver.com>

2da8ca82

cgroup: attach cgroup_open_file to all cgroup files · 7da11279

由 Tejun Heo 提交于 12月 05, 2013

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch
attaches cgroup_open_file, which used to be attached to pidlist files,
to all cgroup files, introduces seq_css/cft() accessors to determine
the cgroup_subsys_state and cftype associated with a given cgroup
seq_file, exports them as public interface.

This doesn't cause any behavior changes but unifies cgroup file
handling across different file types and will help converting them to
kernfs seq_show() interface.

v2: Li pointed out that the original patch was using
    single_open_size() incorrectly assuming that the size param is
    private data size.  Fix it by allocating @of separately and
    passing it to single_open() and explicitly freeing it in the
    release path.  This isn't the prettiest but this path is gonna be
    restructured by the following patches pretty soon.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

7da11279

cgroup: generalize cgroup_pidlist_open_file · 5d22444f

由 Tejun Heo 提交于 12月 05, 2013

In preparation of conversion to kernfs, cgroup file handling is
updated so that it can be easily mapped to kernfs.  This patch renames
cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
only contains a field to identify the specific file, ->cfe, and an
opaque ->priv pointer.  When cgroup is converted to kernfs, this will
be replaced by kernfs_open_file which contains about the same
information.

As whether the file is "cgroup.procs" or "tasks" should now be
determined from cgroup_open_file->cfe, the cftype->private for the two
files now carry the file type and cgroup_pidlist_start() reads the
type through cfe->type->private.  This makes the distinction between
cgroup_tasks_open() and cgroup_procs_open() unnecessary.
cgroup_pidlist_open() is now directly used as the open method.

This patch doesn't make any behavior changes.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

5d22444f

cgroup: unify read path so that seq_file is always used · 896f5199

由 Tejun Heo 提交于 12月 05, 2013

With the recent removal of cftype->read() and ->read_map(), only three
operations are remaining, ->read_u64(), ->read_s64() and
->read_seq_string().  Currently, the first two are handled directly
while the last is handled through seq_file.

It is trivial to serve the first two through the seq_file path too.
This patch restructures read path so that all operations are served
through cgroup_seqfile_show().  This makes all cgroup files seq_file -
single_open/release() are now used by default,
cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
seq_read() for read.

This simplifies the code and makes the read path easy to convert to
use kernfs.

Note that, while cgroup_file_operations uses seq_read() for read, it
still uses generic_file_llseek() for seeking instead of seq_lseek().
This is different from cgroup_seqfile_operations but shouldn't break
anything and brings the seeking behavior aligned with kernfs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

896f5199

cgroup: unify cgroup_write_X64() and cgroup_write_string() · a742c59d

由 Tejun Heo 提交于 12月 05, 2013

cgroup_write_X64() and cgroup_write_string() both implement about the
same buffering logic.  Unify the two into cgroup_file_write() which
always allocates dynamic buffer for simplicity and uses kstrto*()
instead of simple_strto*().

This patch doesn't make any visible behavior changes except for
possibly different error value from kstrsto*().
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

a742c59d

cgroup: remove cftype->read(), ->read_map() and ->write() · 6e0755b0

由 Tejun Heo 提交于 12月 05, 2013

In preparation of conversion to kernfs, cgroup file handling is being
consolidated so that it can be easily mapped to the seq_file based
interface of kernfs.

After recent updates, ->read() and ->read_map() don't have any user
left and ->write() never had any user.  Remove them.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

6e0755b0

29 11月, 2013 9 次提交

cgroup: don't guarantee cgroup.procs is sorted if sane_behavior · afb2bc14

由 Tejun Heo 提交于 11月 29, 2013

For some reason, tasks and cgroup.procs guarantee that the result is
sorted.  This is the only reason this whole pidlist logic is necessary
instead of just iterating through sorted member tasks.  We can't do
anything about the existing interface but at least ensure that such
expectation doesn't exist for the new interface so that pidlist logic
may be removed in the distant future.

This patch scrambles the sort order if sane_behavior so that the
output is usually not sorted in the new interface.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

afb2bc14

cgroup: remove cgroup_pidlist->use_count · 04502365

由 Tejun Heo 提交于 11月 29, 2013

After the recent changes, pidlist ref is held only between
cgroup_pidlist_start() and cgroup_pidlist_stop() during which
cgroup->pidlist_mutex is also held.  IOW, the reference count is
redundant now.  While in use, it's always one and pidlist_mutex is
held - holding the mutex has exactly the same protection.

This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
so that pidlist_mutex is not released inbetween and drops
pidlist->use_count.

This patch shouldn't introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

04502365

cgroup: load and release pidlists from seq_file start and stop respectively · 4bac00d1

由 Tejun Heo 提交于 11月 29, 2013

Currently, pidlists are reference counted from file open and release
methods.  This means that holding onto an open file may waste memory
and reads may return data which is very stale.  Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show.  This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration.  This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

The previous patches implemented delayed release and restructured
pidlist handling so that pidlists can be loaded and released from
seq_file start / stop.  This patch actually moves pidlist load to
start and release to stop.

This means that pidlist is pinned only between start and stop and may
go away between two consecutive read calls if the two calls are apart
by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
thus can't re-use the stored cgroup_pid_list_open_file->pidlist
directly.  During start, it's only used as a hint indicating whether
this is the first start after open or not and pidlist is always looked
up or created.

pidlist_mutex locking and reference counting are moved out of
pidlist_array_load() so that pidlist_array_load() can perform lookup
and creation atomically.  While this enlarges the area covered by
pidlist_mutex, given how the lock is used, it's highly unlikely to be
noticeable.

v2: Refreshed on top of the updated "cgroup: introduce struct
    cgroup_pidlist_open_file".
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

4bac00d1

cgroup: remove cgroup_pidlist->rwsem · 069df3b7

由 Tejun Heo 提交于 11月 29, 2013

cgroup_pidlist locking is needlessly complicated.  It has outer
cgroup->pidlist_mutex to protect the list of pidlists associated with
a cgroup and then each pidlist has rwsem to synchronize updates and
reads.  Given that the only read access is from seq_file operations
which are always invoked back-to-back, the rwsem is a giant overkill.
All it does is adding unnecessary complexity.

This patch removes cgroup_pidlist->rwsem and protects all accesses to
pidlists belonging to a cgroup with cgroup->pidlist_mutex.
pidlist->rwsem locking is removed if it's nested inside
cgroup->pidlist_mutex; otherwise, it's replaced with
cgroup->pidlist_mutex locking.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

069df3b7

cgroup: refactor cgroup_pidlist_find() · e6b81710

由 Tejun Heo 提交于 11月 29, 2013

Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
separate out finding proper to cgroup_pidlist_find().  Also, move
locking to the caller.

This patch is preparation for pidlist restructure and doesn't
introduce any behavior changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

e6b81710

cgroup: introduce struct cgroup_pidlist_open_file · 62236858

由 Tejun Heo 提交于 11月 29, 2013

For pidlist files, seq_file->private pointed to the loaded
cgroup_pidlist; however, pidlist loading is planned to be moved to
cgroup_pidlist_start() for kernfs conversion and seq_file->private
needs to carry more information from open to allow that.

This patch introduces struct cgroup_pidlist_open_file which contains
type, cgrp and pidlist and updates pidlist seq_file->private to point
to it using seq_open_private() and seq_release_private().  Note that
this eventually will be replaced by kernfs_open_file.

While this patch makes more information available to seq_file
operations, they don't use it yet and this patch doesn't introduce any
behavior changes except for allocation of the extra private struct.

v2: use __seq_open_private() instead of seq_open_private() for brevity
    as suggested by Li.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

62236858

cgroup: implement delayed destruction for cgroup_pidlist · b1a21367

由 Tejun Heo 提交于 11月 29, 2013

Currently, pidlists are reference counted from file open and release
methods. This means that holding onto an open file may waste memory
and reads may return data which is very stale. Both aren't critical
because pidlists are keyed and shared per namespace and, well, the
user isn't supposed to have large delay between open and reads.

cgroup is planned to be converted to use kernfs and it'd be best if we
can stick to just the seq_file operations - start, next, stop and
show. This can be achieved by loading pidlist on demand from start
and release with time delay from stop, so that consecutive reads don't
end up reloading the pidlist on each iteration. This would remove the
need for hooking into open and release while also avoiding issues with
holding onto pidlist for too long.

This patch implements delayed release of pidlist. As pidlists could
be lingering on cgroup removal waiting for the timer to expire, cgroup
free path needs to queue the destruction work item immediately and
flush. As those work items are self-destroying, each work item can't
be flushed directly. A new workqueue - cgroup_pidlist_destroy_wq - is
added to serve as flush domain.

Note that this patch just adds delayed release on top of the current
implementation and doesn't change where pidlist is loaded and
released. Following patches will make those changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

b1a21367

cgroup: remove cftype->release() · b9f3ceca

由 Tejun Heo 提交于 11月 29, 2013

Now that pidlist files don't use cftype->release(), it doesn't have
any user left.  Remove it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

b9f3ceca

cgroup: don't skip seq_open on write only opens on pidlist files · ac1e69aa

由 Tejun Heo 提交于 11月 29, 2013

Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
if the file is opened write-only, which is a sensible optimization as
pidlist loading can be costly and there often are occasions where
tasks or cgroup.procs is opened write-only.  However, pidlist init and
release are planned to be moved to cgroup_pidlist_start/stop()
respectively which would make this optimization unnecessary.

This patch removes the optimization and always fully initializes
pidlist files regardless of open mode.  This will help moving pidlist
handling to start/stop by unifying rw paths and removes the need for
specifying cftype->release() in addition to .release in
cgroup_pidlist_operations as file->f_op is now always overridden.  As
pidlist files were the only user of cftype->release(), the next patch
will remove the method.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>

ac1e69aa

28 11月, 2013 1 次提交

cgroup: fix cgroup_subsys_state leak for seq_files · e605b365

由 Tejun Heo 提交于 11月 27, 2013

If a cgroup file implements either read_map() or read_seq_string(),
such file is served using seq_file by overriding file->f_op to
cgroup_seqfile_operations, which also overrides the release method to
single_release() from cgroup_file_release().

Because cgroup_file_open() didn't use to acquire any resources, this
used to be fine, but since f7d58818 ("cgroup: pin
cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
pins the css (cgroup_subsys_state) which is put by
cgroup_file_release().  The patch forgot to update the release path
for seq_files and each open/release cycle leaks a css reference.

Fix it by updating cgroup_file_release() to also handle seq_files and
using it for seq_file release path too.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v3.12

e605b365

23 11月, 2013 4 次提交

cgroup: unexport cgroup_css() and remove __file_cft() · b36824c7

由 Tejun Heo 提交于 11月 22, 2013

Now that cgroup_event is made memcg specific, the temporarily exported
functions are no longer necessary.  Unexport cgroup_css() and remove
__file_cft() which doesn't have any user left.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>

b36824c7

cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg · fba94807

由 Tejun Heo 提交于 11月 22, 2013

cgroup_event is being moved from cgroup core to memcg and the
implementation is already moved by the previous patch.  This patch
moves the data fields and callbacks.

* cgroup->event_list[_lock] are moved to mem_cgroup.

* cftype->[un]register_event() are moved to cgroup_event.  This makes
  it impossible for individual cftype definitions to specify their
  event callbacks.  This is worked around by simply hard-coding
  filename to event callback mapping in cgroup_write_event_control().
  This is awkward and inflexible, which is actually desirable given
  that we don't want to grow more usages of this feature.

* eventfd_ctx declaration is removed from cgroup.h, which makes
  vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
  vmpressure.h.

v2: Use file name from dentry instead of cftype.  This will allow
    removing all cftype handling in the function.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>

fba94807

cgroup, memcg: move cgroup_event implementation to memcg · 79bd9814

由 Tejun Heo 提交于 11月 22, 2013

cgroup_event is way over-designed and tries to build a generic
flexible event mechanism into cgroup - fully customizable event
specification for each user of the interface.  This is utterly
unnecessary and overboard especially in the light of the planned
unified hierarchy as there's gonna be single agent.  Simply generating
events at fixed points, or if that's too restrictive, configureable
cadence or single set of configureable points should be enough.

Thankfully, memcg is the only user and gets to keep it.  Replacing it
with something simpler on sane_behavior is strongly recommended.

This patch moves cgroup_event and "cgroup.event_control"
implementation to mm/memcontrol.c.  Clearing of events on cgroup
destruction is moved from cgroup_destroy_locked() to
mem_cgroup_css_offline(), which shouldn't make any noticeable
difference.

cgroup_css() and __file_cft() are exported to enable the move;
however, this will soon be reverted once the event code is updated to
be memcg specific.

Note that "cgroup.event_control" will now exist only on the hierarchy
with memcg attached to it.  While this change is visible to userland,
it is unlikely to be noticeable as the file has never been meaningful
outside memcg.

Aside from the above change, this is pure code relocation.

v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
    poll.h inclusion moved from cgroup.c to memcontrol.c.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <bsingharora@gmail.com>

79bd9814

cgroup: use a dedicated workqueue for cgroup destruction · e5fca243

由 Tejun Heo 提交于 11月 22, 2013

Since be445626 ("cgroup: remove synchronize_rcu() from
cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
freeing is performed from a work item from that point on and a later
commit, ea15f8cc ("cgroup: split cgroup destruction into two
steps"), moves css offlining to workqueue too.

As cgroup destruction isn't depended upon for memory reclaim, the
destruction work items were put on the system_wq; unfortunately, some
controller may block in the destruction path for considerable duration
while holding cgroup_mutex.  As large part of destruction path is
synchronized through cgroup_mutex, when combined with high rate of
cgroup removals, this has potential to fill up system_wq's max_active
of 256.

Also, it turns out that memcg's css destruction path ends up queueing
and waiting for work items on system_wq through work_on_cpu().  If
such operation happens while system_wq is fully occupied by cgroup
destruction work items, work_on_cpu() can't make forward progress
because system_wq is full and other destruction work items on
system_wq can't make forward progress because the work item waiting
for work_on_cpu() is holding cgroup_mutex, leading to deadlock.

This can be fixed by queueing destruction work items on a separate
workqueue.  This patch creates a dedicated workqueue -
cgroup_destroy_wq - for this purpose.  As these work items shouldn't
have inter-dependencies and mostly serialized by cgroup_mutex anyway,
giving high concurrency level doesn't buy anything and the workqueue's
@max_active is set to 1 so that destruction work items are executed
one by one on each CPU.

Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
separate core_initcall().  In the future, we probably want to reorder
so that workqueue init happens before cgroup_init().
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NHugh Dickins <hughd@google.com>
Reported-by: NShawn Bohrer <shawn.bohrer@gmail.com>
Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
Cc: stable@vger.kernel.org # v3.9+

e5fca243

16 11月, 2013 1 次提交

consolidate simple ->d_delete() instances · b26d4cd3

由 Al Viro 提交于 10月 25, 2013

Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b26d4cd3

14 10月, 2013 1 次提交

cgroup: fix to break the while loop in cgroup_attach_task() correctly · ea84753c

由 Anjana V Kumar 提交于 10月 12, 2013

Both Anjana and Eunki reported a stall in the while_each_thread loop
in cgroup_attach_task().

It's because, when we attach a single thread to a cgroup, if the cgroup
is exiting or is already in that cgroup, we won't break the loop.

If the task is already in the cgroup, the bug can lead to another thread
being attached to the cgroup unexpectedly:

  # echo 5207 > tasks
  # cat tasks
  5207
  # echo 5207 > tasks
  # cat tasks
  5207
  5215

What's worse, if the task to be attached isn't the leader of the thread
group, we might never exit the loop, hence cpu stall. Thanks for Oleg's
analysis.

This bug was introduced by commit 081aa458
("cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc()")

[ lizf: - fixed the first continue, pointed out by Oleg,
        - rewrote changelog. ]

Cc: <stable@vger.kernel.org> # 3.9+
Reported-by: NEunki Kim <eunki_kim@samsung.com>
Reported-by: NAnjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: NAnjana V Kumar <anjanavk12@gmail.com>
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

ea84753c

24 9月, 2013 1 次提交

cgroup: kill css_id · 2ff2a7d0

由 Li Zefan 提交于 9月 23, 2013

The only user of css_id was memcg, and it has been convered to use
cgroup->id, so kill css_id.
Signed-off-by: NLi Zefan <lizefan@huwei.com>
Reviewed-by: NMichal Hocko <mhocko@suse.cz>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

2ff2a7d0

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功