提交 · 8cfd8147df67e741d93b8783a3ea8f3c74f93a0e · openanolis / cloud-kernel

21 7月, 2017 2 次提交

cgroup: implement cgroup v2 thread support · 8cfd8147

由 Tejun Heo 提交于 7月 21, 2017

This patch implements cgroup v2 thread support.  The goal of the
thread mode is supporting hierarchical accounting and control at
thread granularity while staying inside the resource domain model
which allows coordination across different resource controllers and
handling of anonymous resource consumptions.

A cgroup is always created as a domain and can be made threaded by
writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
becomes a member of a threaded subtree which is anchored at the
closest ancestor which isn't threaded.

The threads of the processes which are in a threaded subtree can be
placed anywhere without being restricted by process granularity or
no-internal-process constraint.  Note that the threads aren't allowed
to escape to a different threaded subtree.  To be used inside a
threaded subtree, a controller should explicitly support threaded mode
and be able to handle internal competition in the way which is
appropriate for the resource.

The root of a threaded subtree, the nearest ancestor which isn't
threaded, is called the threaded domain and serves as the resource
domain for the whole subtree.  This is the last cgroup where domain
controllers are operational and where all the domain-level resource
consumptions in the subtree are accounted.  This allows threaded
controllers to operate at thread granularity when requested while
staying inside the scope of system-level resource distribution.

As the root cgroup is exempt from the no-internal-process constraint,
it can serve as both a threaded domain and a parent to normal cgroups,
so, unlike non-root cgroups, the root cgroup can have both domain and
threaded children.

Internally, in a threaded subtree, each css_set has its ->dom_cset
pointing to a matching css_set which belongs to the threaded domain.
This ensures that thread root level cgroup_subsys_state for all
threaded controllers are readily accessible for domain-level
operations.

This patch enables threaded mode for the pids and perf_events
controllers.  Neither has to worry about domain-level resource
consumptions and it's enough to simply set the flag.

For more details on the interface and behavior of the thread mode,
please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
by this patch.

v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
      Spotted by Waiman.
    - Documentation updated as suggested by Waiman.
    - cgroup.type content slightly reformatted.
    - Mark the debug controller threaded.

v4: - Updated to the general idea of marking specific cgroups
      domain/threaded as suggested by PeterZ.

v3: - Dropped "join" and always make mixed children join the parent's
      threaded subtree.

v2: - After discussions with Waiman, support for mixed thread mode is
      added.  This should address the issue that Peter pointed out
      where any nesting should be avoided for thread subtrees while
      coexisting with other domain cgroups.
    - Enabling / disabling thread mode now piggy backs on the existing
      control mask update mechanism.
    - Bug fixes and cleanup.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

8cfd8147

cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad

由 Tejun Heo 提交于 5月 15, 2017

cgroup v2 is in the process of growing thread granularity support.  A
threaded subtree is composed of a thread root and threaded cgroups
which are proper members of the subtree.

The root cgroup of the subtree serves as the domain cgroup to which
the processes (as opposed to threads / tasks) of the subtree
conceptually belong and domain-level resource consumptions not tied to
any specific task are charged.  Inside the subtree, threads won't be
subject to process granularity or no-internal-task constraint and can
be distributed arbitrarily across the subtree.

This patch introduces cgroup->dom_cgrp along with threaded css_set
handling.

* cgroup->dom_cgrp points to self for normal and thread roots.  For
  proper thread subtree members, points to the dom_cgrp (the thread
  root).

* css_set->dom_cset points to self if for normal and thread roots.  If
  threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
  The dom_cgrp serves as the resource domain and keeps the matching
  csses available.  The dom_cset holds those csses and makes them
  easily accessible.

* All threaded csets are linked on their dom_csets to enable iteration
  of all threaded tasks.

* cgroup->nr_threaded_children keeps track of the number of threaded
  children.

This patch adds the above but doesn't actually use them yet.  The
following patches will build on top.

v4: ->nr_threaded_children added.

v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
    enable-threaded-per-cgroup behavior.

v2: Added cgroup_is_threaded() helper.
Signed-off-by: NTejun Heo <tj@kernel.org>

454000ad

17 7月, 2017 1 次提交

cgroup: distinguish local and children populated states · 788b950c

由 Tejun Heo 提交于 7月 16, 2017

cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.

This patch splits the counter into two so that local and children
populated states are tracked separately.  It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.
Signed-off-by: NTejun Heo <tj@kernel.org>

788b950c

29 6月, 2017 1 次提交

cgroup: implement "nsdelegate" mount option · 5136f636

由 Tejun Heo 提交于 6月 27, 2017

Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments.  This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries.  A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.

This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries.  If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.

This allows cgroup namespace to function as a delegation boundary by
itself.

v2: Silently ignore nsdelegate specified on !init mounts.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>

5136f636

15 6月, 2017 1 次提交

cgroup: Keep accurate count of tasks in each css_set · 73a7242a

由 Waiman Long 提交于 6月 13, 2017

The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
    Refreshed on top of cgroup/for-v4.13 which dropped on
    css_set_populated() -> nr_tasks conversion.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

73a7242a

18 5月, 2017 1 次提交

cgroup: Prevent kill_css() from being called more than once · 33c35aa4

由 Waiman Long 提交于 5月 15, 2017

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+

33c35aa4

11 4月, 2017 1 次提交

cgroup: move cgroup_subsys_state parent field for cache locality · b8b1a2e5

由 Todd Poynor 提交于 4月 06, 2017

Various structures embed a struct cgroup_subsys_state, typically at
the top of the containing structure.  It is common for code that
accesses the structures to perform operations that iterate over the
chain of parent css pointers, also accessing data in each containing
structure.  In particular, struct cpuacct is used by fairly hot code
paths in the scheduler such as cpuacct_charge().

Move the parent css pointer field to the end of the structure to
increase the chances of residing in the same cache line as the data
from the containing structure.
Signed-off-by: NTodd Poynor <toddpoynor@google.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b8b1a2e5

09 3月, 2017 1 次提交

kernel: convert css_set.refcount from atomic_t to refcount_t · 4b9502e6

由 Elena Reshetova 提交于 3月 08, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

4b9502e6

02 3月, 2017 1 次提交

sched/headers, cgroups: Remove the threadgroup_change_*() wrappery · 780de9dd

由 Ingo Molnar 提交于 2月 02, 2017

threadgroup_change_begin()/end() is a pointless wrapper around
cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
in the !CONFIG_CGROUPS=y case.

Remove the wrappery, move the might_sleep() (the down_read()
already has a might_sleep() check).

This debloats <linux/sched.h> a bit and simplifies this API.

Update all call sites.

No change in functionality.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

780de9dd

28 12月, 2016 2 次提交

cgroup: reorder css_set fields · 5f617ebb

由 Tejun Heo 提交于 12月 27, 2016

Reorder css_set fields so that they're roughly in the order of how hot
they are.  The rough order is

1. the actual csses
2. reference counter and the default cgroup pointer.
3. task lists and iterations
4. fields used during merge including css_set lookup
5. the rest
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

5f617ebb

cgroup add cftype->open/release() callbacks · e90cbebc

由 Tejun Heo 提交于 12月 27, 2016

Pipe the newly added kernfs->open/release() callbacks through cftype.
While at it, as cleanup operations now can be performed from
->release() instead of ->seq_stop(), make the latter optional.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

e90cbebc

26 11月, 2016 1 次提交

cgroup: add support for eBPF programs · 30070984

由 Daniel Mack 提交于 11月 23, 2016

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.
Signed-off-by: NDaniel Mack <daniel@zonque.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30070984

26 4月, 2016 1 次提交

cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb

由 Tejun Heo 提交于 4月 21, 2016

Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
kicks off asynchronous NUMA node migration if necessary during task
migration and flushes it from cpuset_post_attach_flush() which is
called at the end of __cgroup_procs_write().  This is to avoid
performing migration with cgroup_threadgroup_rwsem write-locked which
can lead to deadlock through dependency on kworker creation.

memcg has a similar issue with charge moving, so let's convert it to
an official callback rather than the current one-off cpuset specific
function.  This patch adds cgroup_subsys->post_attach callback and
makes cpuset register cpuset_post_attach_flush() as its ->post_attach.

The conversion is mostly one-to-one except that the new callback is
called under cgroup_mutex.  This is to guarantee that no other
migration operations are started before ->post_attach callbacks are
finished.  cgroup_mutex is one of the outermost mutex in the system
and has never been and shouldn't be a problem.  We can add specialized
synchronization around __cgroup_procs_write() but I don't think
there's any noticeable benefit.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch

5cf1cacb

17 3月, 2016 1 次提交

cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf

由 Tejun Heo 提交于 3月 15, 2016

Before 2e91fa7f ("cgroup: keep zombies associated with their
original cgroups"), all dead tasks were associated with init_css_set.
If a zombie task is requested for migration, while migration prep
operations would still be performed on init_css_set, the actual
migration would ignore zombie tasks.  As init_css_set is always valid,
this worked fine.

However, after 2e91fa7f, zombie tasks stay with the css_set it was
associated with at the time of death.  Let's say a task T associated
with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
becomes a zombie, it would still remain associated with A and B.  If A
only contains zombie tasks, it can be removed.  On removal, A gets
marked offline but stays pinned until all zombies are drained.  At
this point, if migration is initiated on T to a cgroup C on hierarchy
H-2, migration path would try to prepare T's css_set for migration and
trigger the following.

 WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
 CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
 ...
 Call Trace:
  [<ffffffff8127e63c>] dump_stack+0x4e/0x82
  [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
  [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
  [<ffffffff810c33e1>] cgroup_get+0x121/0x160
  [<ffffffff810c349b>] link_css_set+0x7b/0x90
  [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
  [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
  [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
  [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
  [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
  [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
  [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
  [<ffffffff81151673>] __vfs_write+0x23/0xe0
  [<ffffffff81152494>] vfs_write+0xa4/0x1a0
  [<ffffffff811532d4>] SyS_write+0x44/0xa0
  [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f

It doesn't make sense to prepare migration for css_sets pointing to
dead cgroups as they are guaranteed to contain only zombies which are
ignored later during migration.  This patch makes cgroup destruction
path mark all affected css_sets as dead and updates the migration path
to ignore them during preparation.
Signed-off-by: NTejun Heo <tj@kernel.org>
Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
Cc: stable@vger.kernel.org # v4.4+

2b021cbf

09 3月, 2016 2 次提交

cgroup: implement cgroup_subsys->implicit_on_dfl · f6d635ad

由 Tejun Heo 提交于 3月 08, 2016

Some controllers, perf_event for now and possibly freezer in the
future, don't really make sense to control explicitly through
"cgroup.subtree_control".  For example, the primary role of perf_event
is identifying the cgroups of tasks; however, because the controller
also keeps a small amount of state per cgroup, it can't be replaced
with simple cgroup membership tests.

This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
the controller is implicitly enabled on all cgroups on the v2
hierarchy so that utility type controllers such as perf_event can be
enabled and function transparently.

An implicit controller doesn't show up in "cgroup.controllers" or
"cgroup.subtree_control", is exempt from no internal process rule and
can be stolen from the default hierarchy even if there are non-root
csses.

v2: Reimplemented on top of the recent updates to css handling and
    subsystem rebinding.  Rebinding implicit subsystems is now a
    simple matter of exempting it from the busy subsystem check.
Signed-off-by: NTejun Heo <tj@kernel.org>

f6d635ad

cgroup: use css_set->mg_dst_cgrp for the migration target cgroup · e4857982

由 Tejun Heo 提交于 3月 08, 2016

Migration can be multi-target on the default hierarchy when a
controller is enabled - processes belonging to each child cgroup have
to be moved to the child cgroup itself to refresh css association.

This isn't a problem for cgroup_migrate_add_src() as each source
css_set still maps to single source and target cgroups; however,
cgroup_migrate_prepare_dst() is called once after all source css_sets
are added and thus might not have a single destination cgroup.  This
is currently worked around by specifying NULL for @dst_cgrp and using
the source's default cgroup as destination as the only multi-target
migration in use is self-targetting.  While this works, it's subtle
and clunky.

As all taget cgroups are already specified while preparing the source
css_sets, this clunkiness can easily be removed by recording the
target cgroup in each source css_set.  This patch adds
css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
used by cgroup_migrate_prepare_dst().  This also makes migration code
ready for arbitrary multi-target migration.
Signed-off-by: NTejun Heo <tj@kernel.org>

e4857982

03 3月, 2016 2 次提交

cgroup: introduce cgroup_{save|propagate|restore}_control() · 15a27c36

由 Tejun Heo 提交于 3月 03, 2016

While controllers are being enabled and disabled in
cgroup_subtree_control_write(), the original subsystem masks are
stashed in local variables so that they can be restored if the
operation fails in the middle.

This patch adds dedicated fields to struct cgroup to be used instead
of the local variables and implements functions to stash the current
values, propagate the changes and restore them recursively.  Combined
with the previous changes, this makes subsystem management operations
fully recursive and modularlized.  This will be used to expand cgroup
core functionalities.

While at it, remove now unused @css_enable and @css_disable from
cgroup_subtree_control_write().
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>

15a27c36

cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9

由 Tejun Heo 提交于 3月 03, 2016

Currently, whether a css (cgroup_subsys_state) has its interface files
created is not tracked and assumed to change together with the owning
cgroup's lifecycle.  cgroup directory and interface creation is being
separated out from internal object creation to help refactoring and
eventually allow cgroups which are not visible through cgroupfs.

This patch adds CSS_VISIBLE to track whether a css has its interface
files created and perform management operations only when necessary
which helps decoupling interface file handling from internal object
lifecycle.  After this patch, all css interface file management
functions can be called regardless of the current state and will
achieve the expected result.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>

88cb04b9

23 2月, 2016 4 次提交

cgroup: convert cgroup_subsys flag fields to bool bitfields · b38e42e9

由 Tejun Heo 提交于 2月 23, 2016

Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

b38e42e9

cgroup: make cgroup subsystem masks u16 · 6e5c8307

由 Tejun Heo 提交于 2月 22, 2016

After the recent do_each_subsys_mask() conversion, there's no reason
to use ulong for subsystem masks.  We'll be adding more subsystem
masks to persistent data structures, let's reduce its size to u16
which should be enough for now and the foreseeable future.

This doesn't create any noticeable behavior differences.

v2: Johannes spotted that the initial patch missed cgroup_no_v1_mask.
    Converted.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>

6e5c8307

cgroup: s/child_subsys_mask/subtree_ss_mask/ · 8699b776

由 Tejun Heo 提交于 2月 22, 2016

For consistency with cgroup->subtree_control.

* cgroup->child_subsys_mask -> cgroup->subtree_ss_mask
* cgroup_calc_child_subsys_mask() -> cgroup_calc_subtree_ss_mask()
* cgroup_refresh_child_subsys_mask() -> cgroup_refresh_subtree_ss_mask()

No functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>

8699b776

Revert "cgroup: add cgroup_subsys->css_e_css_changed()" · 5eb385cc

由 Tejun Heo 提交于 2月 22, 2016

This reverts commit 56c807ba.

cgroup_subsys->css_e_css_changed() was supposed to be used by cgroup
writeback support; however, the change to per-inode cgroup association
made it unnecessary and the callback doesn't have any user.  Remove
it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>

5eb385cc

22 1月, 2016 1 次提交

cgroup: make sure a parent css isn't offlined before its children · aa226ff4

由 Tejun Heo 提交于 1月 21, 2016

There are three subsystem callbacks in css shutdown path -
css_offline(), css_released() and css_free().  Except for
css_released(), cgroup core didn't guarantee the order of invocation.
css_offline() or css_free() could be called on a parent css before its
children.  This behavior is unexpected and led to bugs in cpu and
memory controller.

This patch updates offline path so that a parent css is never offlined
before its children.  Each css keeps online_cnt which reaches zero iff
itself and all its children are offline and offline_css() is invoked
only after online_cnt reaches zero.

This fixes the memory controller bug and allows the fix for cpu
controller.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Reported-by: NBrian Christiansen <brian.o.christiansen@gmail.com>
Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org

aa226ff4

10 12月, 2015 1 次提交

cgroup: fix sock_cgroup_data initialization on earlier compilers · ad2c8c73

由 Tejun Heo 提交于 12月 09, 2015

sock_cgroup_data is a struct containing an anonymous union.
sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were
initializing a field inside the anonymous union as follows.

 struct sock_ccgroup_data skcd_buf = { .val = VAL };

While this is fine on more recent compilers, gcc-4.4.7 triggers the
following errors.

 include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’:
 include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer
 include/linux/cgroup-defs.h:619: warning: missing braces around initializer
 include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’)

This is because .val belongs to the anonymous union nested inside the
struct but the initializer is missing the nesting.  Fix it by adding
an extra pair of braces.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NAlaa Hleihel <alaa@dev.mellanox.co.il>
Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ad2c8c73

09 12月, 2015 2 次提交

sock, cgroup: add sock->sk_cgroup · bd1060a1

由 Tejun Heo 提交于 12月 07, 2015

In cgroup v1, dealing with cgroup membership was difficult because the
number of membership associations was unbound.  As a result, cgroup v1
grew several controllers whose primary purpose is either tagging
membership or pull in configuration knobs from other subsystems so
that cgroup membership test can be avoided.

net_cls and net_prio controllers are examples of the latter.  They
allow configuring network-specific attributes from cgroup side so that
network subsystem can avoid testing cgroup membership; unfortunately,
these are not only cumbersome but also problematic.

Both net_cls and net_prio aren't properly hierarchical.  Both inherit
configuration from the parent on creation but there's no interaction
afterwards.  An ancestor doesn't restrict the behavior in its subtree
in anyway and configuration changes aren't propagated downwards.
Especially when combined with cgroup delegation, this is problematic
because delegatees can mess up whatever network configuration
implemented at the system level.  net_prio would allow the delegatees
to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
the same for classid.

While it is possible to solve these issues from controller side by
implementing hierarchical allowable ranges in both controllers, it
would involve quite a bit of complexity in the controllers and further
obfuscate network configuration as it becomes even more difficult to
tell what's actually being configured looking from the network side.
While not much can be done for v1 at this point, as membership
handling is sane on cgroup v2, it'd be better to make cgroup matching
behave like other network matches and classifiers than introducing
further complications.

In preparation, this patch updates sock->sk_cgrp_data handling so that
it points to the v2 cgroup that sock was created in until either
net_prio or net_cls is used.  Once either of the two is used,
sock->sk_cgrp_data reverts to its previous role of carrying prioidx
and classid.  This is to avoid adding yet another cgroup related field
to struct sock.

As the mode switching can happen at most once per boot, the switching
mechanism is aimed at lowering hot path overhead.  It may leak a
finite, likely small, number of cgroup refs and report spurious
prioidx or classid on switching; however, dynamic updates of prioidx
and classid have always been racy and lossy - socks between creation
and fd installation are never updated, config changes don't update
existing sockets at all, and prioidx may index with dead and recycled
cgroup IDs.  Non-critical inaccuracies from small race windows won't
make any noticeable difference.

This patch doesn't make use of the pointer yet.  The following patch
will implement netfilter match for cgroup2 membership.

v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
    cgroup specific field.

v3: Add comments explaining why sock_data_prioidx() and
    sock_data_classid() use different fallback values.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
CC: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd1060a1

net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe

由 Tejun Heo 提交于 12月 07, 2015

Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
and its accessors are defined in cgroup-defs.h.  This is to prepare
for overloading the fields with a cgroup pointer.

This patch mostly performs equivalent conversions but the followings
are noteworthy.

* Equality test before updating classid is removed from
  sock_update_classid().  This shouldn't make any noticeable
  difference and a similar test will be implemented on the helper side
  later.

* sock_update_netprioidx() now takes struct sock_cgroup_data and can
  be moved to netprio_cgroup.h without causing include dependency
  loop.  Moved.

* The dummy version of sock_update_netprioidx() converted to a static
  inline function while at it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2a56a1fe

03 12月, 2015 2 次提交

cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6

由 Oleg Nesterov 提交于 12月 03, 2015

Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

b53202e6

cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5

由 Tejun Heo 提交于 12月 03, 2015

Consider the following v2 hierarchy.

  P0 (+memory) --- P1 (-memory) --- A
                                 \- B
       
P0 has memory enabled in its subtree_control while P1 doesn't.  If
both A and B contain processes, they would belong to the memory css of
P1.  Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter.  IOW, enabling controllers
can cause atomic migrations into different csses.

The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses.  pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.

 WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
 Modules linked in:
 CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
 ...
  ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
  ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
  ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
 Call Trace:
  [<ffffffff81551ffc>] dump_stack+0x4e/0x82
  [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
  [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
  [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
  [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
  [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
  [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
  [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
  [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
  [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
  [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
  [<ffffffff81265f88>] __vfs_write+0x28/0xe0
  [<ffffffff812666fc>] vfs_write+0xac/0x1a0
  [<ffffffff81267019>] SyS_write+0x49/0xb0
  [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76

This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated.  All controllers are
updated accordingly.

* Controllers which don't care whether there are one or multiple
  target csses can be converted trivially.  cpu, io, freezer, perf,
  netclassid and netprio fall in this category.

* cpuset's current implementation assumes that there's single source
  and destination and thus doesn't support v2 hierarchy already.  The
  only change made by this patchset is how that single destination css
  is obtained.

* memory migration path already doesn't do anything on v2.  How the
  single destination css is obtained is updated and the prep stage of
  mem_cgroup_can_attach() is reordered to accomodate the change.

* pids is the only controller which was affected by this bug.  It now
  correctly handles multi-destination migrations and no longer causes
  counter underflow from incorrect accounting.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
Cc: Aleksa Sarai <cyphar@cyphar.com>

1f7dd3e5

21 11月, 2015 1 次提交

cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it · b11cfb58

由 Tejun Heo 提交于 11月 20, 2015

cgroup_is_descendant() currently walks up the hierarchy and compares
each ancestor to the cgroup in question. While enough for cgroup core
usages, this can't be used in hot paths to test cgroup membership.
This patch adds cgroup->ancestor_ids[] which records the IDs of all
ancestors including self and cgroup->level for the nesting level.

This allows testing whether a given cgroup is a descendant of another
in three finite steps - testing whether the two belong to the same
hierarchy, whether the descendant candidate is at the same or a higher
level than the ancestor and comparing the recorded ancestor_id at the
matching level. cgroup_is_descendant() is accordingly reimplmented
and made inline.
Signed-off-by: NTejun Heo <tj@kernel.org>

b11cfb58

17 11月, 2015 1 次提交

cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type · 67e9c74b

由 Tejun Heo 提交于 11月 16, 2015

With major controllers - cpu, memory and io - shaping up for the
unified hierarchy, cgroup2 is about ready to be, gradually, released
into the wild.  Replace __DEVEL__sane_behavior flag which was used to
select the unified hierarchy with a separate filesystem type "cgroup2"
so that unified hierarchy can be mounted as follows.

  mount -t cgroup2 none $MOUNT_POINT

The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").

v2: Assign a different magic number to cgroup2 fs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NLi Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

67e9c74b

16 11月, 2015 1 次提交

cgroup: fix cftype->file_offset handling · 34c06254

由 Tejun Heo 提交于 11月 05, 2015

6f60eade ("cgroup: generalize obtaining the handles of and
notifying cgroup files") introduced cftype->file_offset so that the
handles for per-css file instances can be recorded.  These handles
then can be used, for example, to generate file modified
notifications.

Unfortunately, it made the wrong assumption that files are created
once for a given css and removed on its destruction.  Due to the
dependencies among subsystems, a css may be hidden from userland and
then later shown again.  This is implemented by removing and
re-creating the affected files, so the associated kernfs_node for a
given cgroup file may change over time.  This incorrect assumption led
to the corruption of css->files lists.

Reimplement cftype->file_offset handling so that cgroup_file->kn is
protected by a lock and updated as files are created and destroyed.
This also makes keeping them on per-cgroup list unnecessary.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NJames Sedgwick <jsedgwick@fb.com>
Fixes: 6f60eade ("cgroup: generalize obtaining the handles of and notifying cgroup files")
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NZefan Li <lizefan@huawei.com>

34c06254

16 10月, 2015 4 次提交

cgroup: add cgroup_subsys->free() method and use it to fix pids controller · afcf6c8b

由 Tejun Heo 提交于 10月 15, 2015

pids controller is completely broken in that it uncharges when a task
exits allowing zombies to escape resource control.  With the recent
updates, cgroup core now maintains cgroup association till task free
and pids controller can be fixed by uncharging on free instead of
exit.

This patch adds cgroup_subsys->free() method and update pids
controller to use it instead of ->exit() for uncharging.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>

afcf6c8b

cgroup: keep zombies associated with their original cgroups · 2e91fa7f

由 Tejun Heo 提交于 10月 15, 2015

cgroup_exit() is called when a task exits and disassociates the
exiting task from its cgroups and half-attach it to the root cgroup.
This is unnecessary and undesirable.

No controller actually needs an exiting task to be disassociated with
non-root cgroups.  Both cpu and perf_event controllers update the
association to the root cgroup from their exit callbacks just to keep
consistent with the cgroup core behavior.

Also, this disassociation makes it difficult to track resources held
by zombies or determine where the zombies came from.  Currently, pids
controller is completely broken as it uncharges on exit and zombies
always escape the resource restriction.  With cgroup association being
reset on exit, fixing it is pretty painful.

There's no reason to reset cgroup membership on exit.  The zombie can
be removed from its css_set so that it doesn't show up on
"cgroup.procs" and thus can't be migrated or interfere with cgroup
removal.  It can still pin and point to the css_set so that its cgroup
membership is maintained.  This patch makes cgroup core keep zombies
associated with their cgroups at the time of exit.

* Previous patches decoupled populated_cnt tracking from css_set
  lifetime, so a dying task can be simply unlinked from its css_set
  while pinning and pointing to the css_set.  This keeps css_set
  association from task side alive while hiding it from "cgroup.procs"
  and populated_cnt tracking.  The css_set reference is dropped when
  the task_struct is freed.

* ->exit() callback no longer needs the css arguments as the
  associated css never changes once PF_EXITING is set.  Removed.

* cpu and perf_events controllers no longer need ->exit() callbacks.
  There's no reason to explicitly switch away on exit.  The final
  schedule out is enough.  The callbacks are removed.

* On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
  still reports "/" for all zombies.  On the default hierarchy,
  "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
  to at the time of exit.  If the cgroup gets removed before the task
  is reaped, " (deleted)" is appended.

v2: Build brekage due to missing dummy cgroup_free() when
    !CONFIG_CGROUP fixed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>

2e91fa7f

cgroup: don't hold css_set_rwsem across css task iteration · ed27b9f7

由 Tejun Heo 提交于 10月 15, 2015

css_sets are synchronized through css_set_rwsem but the locking scheme
is kinda bizarre.  The hot paths - fork and exit - have to write lock
the rwsem making the rw part pointless; furthermore, many readers
already hold cgroup_mutex.

One of the readers is css task iteration.  It read locks the rwsem
over the entire duration of iteration.  This leads to silly locking
behavior.  When cpuset tries to migrate processes of a cgroup to a
different NUMA node, css_set_rwsem is held across the entire migration
attempt which can take a long time locking out forking, exiting and
other cgroup operations.

This patch updates css task iteration so that it locks css_set_rwsem
only while the iterator is being advanced.  css task iteration
involves two levels - css_set and task iteration.  As css_sets in use
are practically immutable, simply pinning the current one is enough
for resuming iteration afterwards.  Task iteration is tricky as tasks
may leave their css_set while iteration is in progress.  This is
solved by keeping track of active iterators and advancing them if
their next task leaves its css_set.

v2: put_task_struct() in css_task_iter_next() moved outside
    css_set_rwsem.  A later patch will add cgroup operations to
    task_struct free path which may grab the same lock and this avoids
    deadlock possibilities.

    css_set_move_task() updated to use list_for_each_entry_safe() when
    walking task_iters and advancing them.  This is necessary as
    advancing an iter may remove it from the list.
Signed-off-by: NTejun Heo <tj@kernel.org>

ed27b9f7

cgroup: make cgroup->nr_populated count the number of populated css_sets · 0de0942d

由 Tejun Heo 提交于 10月 15, 2015

Currently, cgroup->nr_populated counts whether the cgroup has any
css_sets linked to it and the number of children which has non-zero
->nr_populated.  This works because a css_set's refcnt converges with
the number of tasks linked to it and thus there's no css_set linked to
a cgroup if it doesn't have any live tasks.

To help tracking resource usage of zombie tasks, putting the ref of
css_set will be separated from disassociating the task from the
css_set which means that a cgroup may have css_sets linked to it even
when it doesn't have any live tasks.

This patch updates cgroup->nr_populated so that for the cgroup itself
it counts the number of css_sets which have tasks associated with them
so that empty css_sets don't skew the populated test.
Signed-off-by: NTejun Heo <tj@kernel.org>

0de0942d

19 9月, 2015 3 次提交

cgroup: generalize obtaining the handles of and notifying cgroup files · 6f60eade

由 Tejun Heo 提交于 9月 18, 2015

cgroup core handles creations and removals of cgroup interface files
as described by cftypes.  There are cases where the handle for a given
file instance is necessary, for example, to generate a file modified
event.  Currently, this is handled by explicitly matching the callback
method pointer and storing the file handle manually in
cgroup_add_file().  While this simple approach works for cgroup core
files, it can't for controller interface files.

This patch generalizes cgroup interface file handle handling.  struct
cgroup_file is defined and each cftype can optionally tell cgroup core
to store the file handle by setting ->file_offset.  A file handle
remains accessible as long as the containing css is accessible.

Both "cgroup.procs" and "cgroup.events" are converted to use the new
generic mechanism instead of hooking directly into cgroup_add_file().
Also, cgroup_file_notify() which takes a struct cgroup_file and
generates a file modified event on it is added and replaces explicit
kernfs_notify() invocations.

This generalizes cgroup file handle handling and allows controllers to
generate file modified notifications.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

6f60eade

cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE · 7dbdb199

由 Tejun Heo 提交于 9月 18, 2015

cftype->mode allows controllers to give arbitrary permissions to
interface knobs.  Except for "cgroup.event_control", the existing uses
are spurious.

* Some explicitly specify S_IRUGO | S_IWUSR even though that's the
  default.

* "cpuset.memory_pressure" specifies S_IRUGO while also setting a
  write callback which returns -EACCES.  All it needs to do is simply
  not setting a write callback.

"cgroup.event_control" uses cftype->mode to make the file
world-writable.  It's a misdesigned interface and we don't want
controllers to be tweaking interface file permissions in general.
This patch removes cftype->mode and all its spurious uses and
implements CFTYPE_WORLD_WRITABLE for "cgroup.event_control" which is
marked as compatibility-only.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

7dbdb199

cgroup: replace "cgroup.populated" with "cgroup.events" · 4a07c222

由 Tejun Heo 提交于 9月 18, 2015

memcg already uses "memory.events" for event reporting and other
controllers may need event reporting too.  Let's standardize on
"$SUBSYS.events" interface file for reporting events which don't
happen too frequently and thus can share event notification.

"cgroup.populated" is replaced with "populated" field in
"cgroup.events" and documentation is updated accordingly.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>

4a07c222

18 9月, 2015 1 次提交

cgroup: replace cgroup_subsys->disabled tests with cgroup_subsys_enabled() · fc5ed1e9

由 Tejun Heo 提交于 9月 18, 2015

Replace cgroup_subsys->disabled tests in controllers with
cgroup_subsys_enabled().  cgroup_subsys_enabled() requires literal
subsys name as its parameter and thus can't be used for cgroup core
which iterates through controllers.  For cgroup core, introduce and
use cgroup_ssid_enabled() which uses slower static_key_enabled() test
and can be indexed by subsys ID.

This leaves cgroup_subsys->disabled unused.  Removed.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>

fc5ed1e9

17 9月, 2015 1 次提交

sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · 1ed13287

由 Tejun Heo 提交于 9月 16, 2015

Note: This commit was originally committed as d59cfc09 but got
      reverted by 0c986253 due to the performance regression from
      the percpu_rwsem write down/up operations added to cgroup task
      migration path.  percpu_rwsem changes which alleviate the
      performance issue are pending for v4.4-rc1 merge window.
      Re-apply.

The cgroup side of threadgroup locking uses signal_struct->group_rwsem
to synchronize against threadgroup changes.  This per-process rwsem
adds small overhead to thread creation, exit and exec paths, forces
cgroup code paths to do lock-verify-unlock-retry dance in a couple
places and makes it impossible to atomically perform operations across
multiple processes.

This patch replaces signal_struct->group_rwsem with a global
percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
side and contained in cgroups proper.  This patch converts one-to-one.

This does make writer side heavier and lower the granularity; however,
cgroup process migration is a fairly cold path, we do want to optimize
thread operations over it and cgroup migration operations don't take
enough time for the lower granularity to matter.
Signed-off-by: NTejun Heo <tj@kernel.org>
Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>

1ed13287

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功