提交 · bc2fb7ed089ffd16d26e1d95b898a37d2b37d201 · openeuler / raspberrypi-kernel

21 7月, 2017 2 次提交

cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed

由 Tejun Heo 提交于 5月 15, 2017

css_task_iter currently always walks all tasks.  With the scheduled
cgroup v2 thread support, the iterator would need to handle multiple
types of iteration.  As a preparation, add @flags to
css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
is not specified, it walks all tasks as before.  When asserted, the
iterator only walks the group leaders.

For now, the only user of the flag is cgroup v2 "cgroup.procs" file
which no longer needs to skip non-leader tasks in cgroup_procs_next().
Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
cgroup" but "list all thread group id's with any threads in the
cgroup".

While at it, update cgroup_procs_show() to use task_pid_vnr() instead
of task_tgid_vnr().  As the iteration guarantees that the function
only sees group leaders, this doesn't change the output and will allow
sharing the function for thread iteration.
Signed-off-by: NTejun Heo <tj@kernel.org>

bc2fb7ed

cgroup: reorganize cgroup.procs / task write path · 715c809d

由 Tejun Heo 提交于 5月 15, 2017

Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
handled by __cgroup_procs_write() on both v1 and v2.  This patch
reoragnizes the write path so that there are common helper functions
that different write paths use.

While this somewhat increases LOC, the different paths are no longer
intertwined and each path has more flexibility to implement different
behaviors which will be necessary for the planned v2 thread support.

v3: - Restructured so that cgroup_procs_write_permission() takes
      @src_cgrp and @dst_cgrp.

v2: - Rolled in Waiman's task reference count fix.
    - Updated on top of nsdelegate changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>

715c809d

17 7月, 2017 3 次提交

cgroup: replace css_set walking populated test with testing cgrp->nr_populated_csets · 27f26753

由 Tejun Heo 提交于 7月 16, 2017

Implement trivial cgroup_has_tasks() which tests whether
cgrp->nr_populated_csets is zero and replace the explicit local
populated test in cgroup_subtree_control().  This simplifies the code
and cgroup_has_tasks() will be used in more places later.
Signed-off-by: NTejun Heo <tj@kernel.org>

27f26753

cgroup: distinguish local and children populated states · 788b950c

由 Tejun Heo 提交于 7月 16, 2017

cgrp->populated_cnt counts both local (the cgroup's populated
css_sets) and subtree proper (populated children) so that it's only
zero when the whole subtree, including self, is empty.

This patch splits the counter into two so that local and children
populated states are tracked separately.  It allows finer-grained
tests on the state of the hierarchy which will be used to replace
css_set walking local populated test.
Signed-off-by: NTejun Heo <tj@kernel.org>

788b950c

T
cgroup: remove now unused list_head @pending in cgroup_apply_cftypes() · 88e033e3
由 Tejun Heo 提交于 7月 16, 2017
```
Signed-off-by: NTejun Heo <tj@kernel.org>
```
88e033e3

29 6月, 2017 2 次提交

cgroup: implement "nsdelegate" mount option · 5136f636

由 Tejun Heo 提交于 6月 27, 2017

Currently, cgroup only supports delegation to !root users and cgroup
namespaces don't get any special treatments.  This limits the
usefulness of cgroup namespaces as they by themselves can't be safe
delegation boundaries.  A process inside a cgroup can change the
resource control knobs of the parent in the namespace root and may
move processes in and out of the namespace if cgroups outside its
namespace are visible somehow.

This patch adds a new mount option "nsdelegate" which makes cgroup
namespaces delegation boundaries.  If set, cgroup behaves as if write
permission based delegation took place at namespace boundaries -
writes to the resource control knobs from the namespace root are
denied and migration crossing the namespace boundary aren't allowed
from inside the namespace.

This allows cgroup namespace to function as a delegation boundary by
itself.

v2: Silently ignore nsdelegate specified on !init mounts.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Aravind Anbudurai <aru7@fb.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Eric Biederman <ebiederm@xmission.com>

5136f636

cgroup: restructure cgroup_procs_write_permission() · 824ecbe0

由 Tejun Heo 提交于 6月 25, 2017

Restructure cgroup_procs_write_permission() to make extending
permission logic easier.

This patch doesn't cause any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>

824ecbe0

15 6月, 2017 1 次提交

cgroup: Keep accurate count of tasks in each css_set · 73a7242a

由 Waiman Long 提交于 6月 13, 2017

The reference count in the css_set data structure was used as a
proxy of the number of tasks attached to that css_set. However, that
count is actually not an accurate measure especially with thread mode
support. So a new variable nr_tasks is added to the css_set to keep
track of the actual task count. This new variable is protected by
the css_set_lock. Functions that require the actual task count are
updated to use the new variable.

tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
    Refreshed on top of cgroup/for-v4.13 which dropped on
    css_set_populated() -> nr_tasks conversion.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

73a7242a

18 5月, 2017 1 次提交

cgroup: Prevent kill_css() from being called more than once · 33c35aa4

由 Waiman Long 提交于 5月 15, 2017

The kill_css() function may be called more than once under the condition
that the css was killed but not physically removed yet followed by the
removal of the cgroup that is hosting the css. This patch prevents any
harmm from being done when that happens.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: stable@vger.kernel.org # v4.5+

33c35aa4

02 5月, 2017 1 次提交

cgroup: mark cgroup_get() with __maybe_unused · 310b4816

由 Tejun Heo 提交于 5月 01, 2017

a590b90d ("cgroup: fix spurious warnings on cgroup_is_dead() from
cgroup_sk_alloc()") converted most cgroup_get() usages to
cgroup_get_live() leaving cgroup_sk_alloc() the sole user of
cgroup_get(). When !CONFIG_SOCK_CGROUP_DATA, this ends up triggering
unused warning for cgroup_get().

Silence the warning by adding __maybe_unused to cgroup_get().
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Link: http://lkml.kernel.org/r/20170501145340.17e8ef86@canb.auug.org.auSigned-off-by: NTejun Heo <tj@kernel.org>

310b4816

29 4月, 2017 2 次提交

cgroup: avoid attaching a cgroup root to two different superblocks, take 2 · 9732adc5

由 Zefan Li 提交于 4月 19, 2017

Commit bfb0b80d ("cgroup: avoid attaching a cgroup root to two
different superblocks") is broken.  Now we try to fix the race by
delaying the initialization of cgroup root refcnt until a superblock
has been allocated.
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Reported-by: NAndrei Vagin <avagin@virtuozzo.com>
Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
Signed-off-by: NZefan Li <lizefan@huawei.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

9732adc5

cgroup: fix spurious warnings on cgroup_is_dead() from cgroup_sk_alloc() · a590b90d

由 Tejun Heo 提交于 4月 28, 2017

cgroup_get() expected to be called only on live cgroups and triggers
warning on a dead cgroup; however, cgroup_sk_alloc() may be called
while cloning a socket which is left in an empty and removed cgroup
and thus may legitimately duplicate its reference on a dead cgroup.
This currently triggers the following warning spuriously.

 WARNING: CPU: 14 PID: 0 at kernel/cgroup.c:490 cgroup_get+0x55/0x60
 ...
  [<ffffffff8107e123>] __warn+0xd3/0xf0
  [<ffffffff8107e20e>] warn_slowpath_null+0x1e/0x20
  [<ffffffff810ff465>] cgroup_get+0x55/0x60
  [<ffffffff81106061>] cgroup_sk_alloc+0x51/0xe0
  [<ffffffff81761beb>] sk_clone_lock+0x2db/0x390
  [<ffffffff817cce06>] inet_csk_clone_lock+0x16/0xc0
  [<ffffffff817e8173>] tcp_create_openreq_child+0x23/0x4b0
  [<ffffffff818601a1>] tcp_v6_syn_recv_sock+0x91/0x670
  [<ffffffff817e8b16>] tcp_check_req+0x3a6/0x4e0
  [<ffffffff81861ba3>] tcp_v6_rcv+0x693/0xa00
  [<ffffffff81837429>] ip6_input_finish+0x59/0x3e0
  [<ffffffff81837cb2>] ip6_input+0x32/0xb0
  [<ffffffff81837387>] ip6_rcv_finish+0x57/0xa0
  [<ffffffff81837ac8>] ipv6_rcv+0x318/0x4d0
  [<ffffffff817778c7>] __netif_receive_skb_core+0x2d7/0x9a0
  [<ffffffff81777fa6>] __netif_receive_skb+0x16/0x70
  [<ffffffff81778023>] netif_receive_skb_internal+0x23/0x80
  [<ffffffff817787d8>] napi_gro_frags+0x208/0x270
  [<ffffffff8168a9ec>] mlx4_en_process_rx_cq+0x74c/0xf40
  [<ffffffff8168b270>] mlx4_en_poll_rx_cq+0x30/0x90
  [<ffffffff81778b30>] net_rx_action+0x210/0x350
  [<ffffffff8188c426>] __do_softirq+0x106/0x2c7
  [<ffffffff81082bad>] irq_exit+0x9d/0xa0 [<ffffffff8188c0e4>] do_IRQ+0x54/0xd0
  [<ffffffff8188a63f>] common_interrupt+0x7f/0x7f <EOI>
  [<ffffffff8173d7e7>] cpuidle_enter+0x17/0x20
  [<ffffffff810bdfd9>] cpu_startup_entry+0x2a9/0x2f0
  [<ffffffff8103edd1>] start_secondary+0xf1/0x100

This patch renames the existing cgroup_get() with the dead cgroup
warning to cgroup_get_live() after cgroup_kn_lock_live() and
introduces the new cgroup_get() which doesn't check whether the cgroup
is live or dead.

All existing cgroup_get() users except for cgroup_sk_alloc() are
converted to use cgroup_get_live().

Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
Cc: stable@vger.kernel.org # v4.5+
Cc: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: NChris Mason <clm@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

a590b90d

17 3月, 2017 1 次提交

cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796

由 Tejun Heo 提交于 3月 16, 2017

Creation of a kthread goes through a couple interlocked stages between
the kthread itself and its creator.  Once the new kthread starts
running, it initializes itself and wakes up the creator.  The creator
then can further configure the kthread and then let it start doing its
job by waking it up.

In this configuration-by-creator stage, the creator is the only one
that can wake it up but the kthread is visible to userland.  When
altering the kthread's attributes from userland is allowed, this is
fine; however, for cases where CPU affinity is critical,
kthread_bind() is used to first disable affinity changes from userland
and then set the affinity.  This also prevents the kthread from being
migrated into non-root cgroups as that can affect the CPU affinity and
many other things.

Unfortunately, the cgroup side of protection is racy.  While the
PF_NO_SETAFFINITY flag prevents further migrations, userland can win
the race before the creator sets the flag with kthread_bind() and put
the kthread in a non-root cgroup, which can lead to all sorts of
problems including incorrect CPU affinity and starvation.

This bug got triggered by userland which periodically tries to migrate
all processes in the root cpuset cgroup to a non-root one.  Per-cpu
workqueue workers got caught while being created and ended up with
incorrected CPU affinity breaking concurrency management and sometimes
stalling workqueue execution.

This patch adds task->no_cgroup_migration which disallows the task to
be migrated by userland.  kthreadd starts with the flag set making
every child kthread start in the root cgroup with migration
disallowed.  The flag is cleared after the kthread finishes
initialization by which time PF_NO_SETAFFINITY is set if the kthread
should stay in the root cgroup.

It'd be better to wait for the initialization instead of failing but I
couldn't think of a way of implementing that without adding either a
new PF flag, or sleeping and retrying from waiting side.  Even if
userland depends on changing cgroup membership of a kthread, it either
has to be synchronized with kthread_create() or periodically repeat,
so it's unlikely that this would break anything.

v2: Switch to a simpler implementation using a new task_struct bit
    field suggested by Oleg.
Signed-off-by: NTejun Heo <tj@kernel.org>
Suggested-by: NOleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-and-debugged-by: NChris Mason <clm@fb.com>
Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
Signed-off-by: NTejun Heo <tj@kernel.org>

77f88796

10 3月, 2017 1 次提交

scripts/spelling.txt: add "disble(d)" pattern and fix typo instances · 8a1115ff

由 Masahiro Yamada 提交于 3月 09, 2017

Fix typos and add the following to the scripts/spelling.txt:

  disble||disable
  disbled||disabled

I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched.  The macro is not referenced at all, but this commit is
touching only comment blocks just in case.

Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8a1115ff

09 3月, 2017 1 次提交

kernel: convert css_set.refcount from atomic_t to refcount_t · 4b9502e6

由 Elena Reshetova 提交于 3月 08, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

4b9502e6

07 3月, 2017 1 次提交

kernel: convert cgroup_namespace.count from atomic_t to refcount_t · 387ad967

由 Elena Reshetova 提交于 2月 20, 2017

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.
Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
Signed-off-by: NHans Liljestrand <ishkamiel@gmail.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NDavid Windsor <dwindsor@gmail.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

387ad967

02 3月, 2017 1 次提交

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/task.h> · 29930025

由 Ingo Molnar 提交于 2月 08, 2017

We are going to split <linux/sched/task.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/task.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

29930025

03 2月, 2017 1 次提交

cgroup: drop the matching uid requirement on migration for cgroup v2 · 576dd464

由 Tejun Heo 提交于 1月 20, 2017

Along with the write access to the cgroup.procs or tasks file, cgroup
has required the writer's euid, unless root, to match [s]uid of the
target process or task. On cgroup v1, this is necessary because
there's nothing preventing a delegatee from pulling in tasks or
processes from all over the system.

If a user has a cgroup subdirectory delegated to it, the user would
have write access to the cgroup.procs or tasks file. If there are no
further checks than file write access check, the user would be able to
pull processes from all over the system into its subhierarchy which is
clearly not the intended behavior. The matching [s]uid requirement
partially prevents this problem by allowing a delegatee to pull in the
processes that belongs to it. This isn't a sufficient protection
however, because a user would still be able to jump processes across
two disjoint sub-hierarchies that has been delegated to them.

cgroup v2 resolves the issue by requiring the writer to have access to
the common ancestor of the cgroup.procs file of the source and target
cgroups. This confines each delegatee to their own sub-hierarchy
proper and bases all permission decisions on the cgroup filesystem
rather than having to pull in explicit uid matching.

cgroup v2 has still been applying the matching [s]uid requirement just
for historical reasons. On cgroup2, the requirement doesn't serve any
purpose while unnecessarily complicating the permission model. Let's
drop it.
Signed-off-by: NTejun Heo <tj@kernel.org>

576dd464

31 1月, 2017 1 次提交

cgroup: misc cleanups · b807421a

由 Tejun Heo 提交于 1月 20, 2017

* cgrp_dfl_implicit_ss_mask is ulong instead of u16 unlike other
  ss_masks.  Make it a u16.

* Move have_canfork_callback together with other callback ss_masks.
Signed-off-by: NTejun Heo <tj@kernel.org>

b807421a

16 1月, 2017 3 次提交

cgroup: call subsys->*attach() only for subsystems which are actually affected by migration · bfc2cf6f

由 Tejun Heo 提交于 1月 15, 2017

Currently, subsys->*attach() callbacks are called for all subsystems
which are attached to the hierarchy on which the migration is taking
place.

With cgroup_migrate_prepare_dst() filtering out identity migrations,
v1 hierarchies can avoid spurious ->*attach() callback invocations
where the source and destination csses are identical; however, this
isn't enough on v2 as only a subset of the attached controllers can be
affected on controller enable/disable.

While spurious ->*attach() invocations aren't critically broken,
they're unnecessary overhead and can lead to temporary overcharges on
certain controllers.  Fix it by tracking which subsystems are affected
by a migration and invoking ->*attach() callbacks only on those
subsystems.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>

bfc2cf6f

cgroup: track migration context in cgroup_mgctx · e595cd70

由 Tejun Heo 提交于 1月 15, 2017

cgroup migration is performed in four steps - css_set preloading,
addition of target tasks, actual migration, and clean up.  A list
named preloaded_csets is used to track the preloading.  This is a bit
too restricted and the code is already depending on the subtlety that
all source css_sets appear before destination ones.

Let's create struct cgroup_mgctx which keeps track of everything
during migration.  Currently, it has separate preload lists for source
and destination csets and also embeds cgroup_taskset which is used
during the actual migration.  This moves struct cgroup_taskset
definition to cgroup-internal.h.

This patch doesn't cause any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>

e595cd70

cgroup: cosmetic update to cgroup_taskset_add() · d8ebf519

由 Tejun Heo 提交于 1月 15, 2017

cgroup_taskset_add() was using list_add_tail() when for source csets
but list_move_tail() for destination.  As the operations are gated by
list_empty() test, list_move_tail() is equivalent to list_add_tail()
here.  Use list_add_tail() too for destination csets too.

This doesn't cause any functional changes.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NZefan Li <lizefan@huawei.com>

d8ebf519

28 12月, 2016 12 次提交

cgroup: fix RCU related sparse warnings · e0aed7c7

由 Tejun Heo 提交于 12月 27, 2016

kn->priv which is a void * is used as a RCU pointer by cgroup.  When
dereferencing it, it was passing kn->priv to rcu_derefreence() without
casting it into a RCU pointer triggering address space mismatch
warning from sparse.  Fix them.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

e0aed7c7

cgroup: move namespace code to kernel/cgroup/namespace.c · dcfe149b

由 Tejun Heo 提交于 12月 27, 2016

get/put_css_set() get exposed in cgroup-internal.h in the process.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

dcfe149b

cgroup: rename functions for consistency · d62beb7f

由 Tejun Heo 提交于 12月 27, 2016

Now that v1 functions are separated out, rename some functions for
consistency.

 cgroup_dfl_base_files		-> cgroup_base_files
 cgroup_legacy_base_files	-> cgroup1_base_files
 cgroup_ssid_no_v1()		-> cgroup1_ssid_disabled()
 cgroup_pidlist_destroy_all	-> cgroup1_pidlist_destroy_all()
 cgroup_release_agent()		-> cgroup1_release_agent()
 check_for_release()		-> cgroup1_check_for_release()
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

d62beb7f

cgroup: move v1 mount functions to kernel/cgroup/cgroup-v1.c · 1592c9b2

由 Tejun Heo 提交于 12月 27, 2016

Now that the v1 mount code is split into separate functions, move them
to kernel/cgroup/cgroup-v1.c along with the mount option handling
code.  As this puts all v1-only kernfs_syscall_ops in cgroup-v1.c,
move cgroup1_kf_syscall_ops to cgroup-v1.c too.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

1592c9b2

cgroup: separate out cgroup1_kf_syscall_ops · fa069904

由 Tejun Heo 提交于 12月 27, 2016

Currently, cgroup_kf_syscall_ops is shared by v1 and v2 and the
specific methods test the version and take different actions.  Split
out v1 functions and put them in cgroup1_kf_syscall_ops and remove the
now unnecessary explicit branches in specific methods.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

fa069904

cgroup: refactor mount path and clearly distinguish v1 and v2 paths · 633feee3

由 Tejun Heo 提交于 12月 27, 2016

While sharing some mechanisms, the mount paths of v1 and v2 are
substantially different.  Their implementations were mixed in
cgroup_mount().  This patch splits them out so that they're easier to
follow and organize.

This patch causes one functional change - the WARN_ON(new_sb) gets
lost.  This is because the actual mounting gets moved to
cgroup_do_mount() and thus @new_sb is no longer accessible by default
to cgroup1_mount().  While we can add it as an explicit out parameter
to cgroup_do_mount(), this part of code hasn't changed and the warning
hasn't triggered for quite a while.  Dropping it should be fine.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

633feee3

cgroup: move cgroup v1 specific code to kernel/cgroup/cgroup-v1.c · 0a268dbd

由 Tejun Heo 提交于 12月 27, 2016

cgroup.c is getting too unwieldy.  Let's move out cgroup v1 specific
code along with the debug controller into kernel/cgroup/cgroup-v1.c.

v2: cgroup_mutex and css_set_lock made available in cgroup-internal.h
    regardless of CONFIG_PROVE_RCU.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

0a268dbd

cgroup: move cgroup files under kernel/cgroup/ · 201af4c0

由 Tejun Heo 提交于 12月 27, 2016

They're growing to be too many and planned to get split further.  Move
them under their own directory.

 kernel/cgroup.c		-> kernel/cgroup/cgroup.c
 kernel/cgroup_freezer.c	-> kernel/cgroup/freezer.c
 kernel/cgroup_pids.c		-> kernel/cgroup/pids.c
 kernel/cpuset.c		-> kernel/cgroup/cpuset.c
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

201af4c0

cgroup: reorder css_set fields · 5f617ebb

由 Tejun Heo 提交于 12月 27, 2016

Reorder css_set fields so that they're roughly in the order of how hot
they are.  The rough order is

1. the actual csses
2. reference counter and the default cgroup pointer.
3. task lists and iterations
4. fields used during merge including css_set lookup
5. the rest
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

5f617ebb

cgroup: remove cgroup_pid_fry() and friends · 2fae9863

由 Tejun Heo 提交于 12月 27, 2016

cgroup_pid_fry() was added to mangle cgroup.procs pid listing order on
v2 to make it clear that the output is not sorted.  Now that v2 now
uses a separate "cgroup.procs" read method, this is no longer used.
Remove it.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

2fae9863

cgroup: reimplement reading "cgroup.procs" on cgroup v2 · b4b90a8e

由 Tejun Heo 提交于 12月 27, 2016

On v1, "tasks" and "cgroup.procs" are expected to be sorted which
makes the implementation expensive and unnecessarily complicated
involving result cache management.

v2 doesn't have the sorting requirement, so it can just iterate and
print processes one by one.  seq_files are either read sequentially or
reset to position zero, so the implementation doesn't even need to
worry about seeking.

This keeps the css_task_iter across multiple read(2) calls and
migrations of new processes always append won't miss processes which
are newly migrated in before each read(2).
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

b4b90a8e

cgroup add cftype->open/release() callbacks · e90cbebc

由 Tejun Heo 提交于 12月 27, 2016

Pipe the newly added kernfs->open/release() callbacks through cftype.
While at it, as cleanup operations now can be performed from
->release() instead of ->seq_stop(), make the latter optional.
Signed-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NAcked-by: Zefan Li <lizefan@huawei.com>

e90cbebc

26 11月, 2016 1 次提交

cgroup: add support for eBPF programs · 30070984

由 Daniel Mack 提交于 11月 23, 2016

This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
        \ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.
Signed-off-by: NDaniel Mack <daniel@zonque.org>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

30070984

29 9月, 2016 1 次提交

cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent() · e0223003

由 Tejun Heo 提交于 9月 29, 2016

4c737b41 ("cgroup: make cgroup_path() and friends behave in the
style of strlcpy()") broke error handling in proc_cgroup_show() and
cgroup_release_agent() by not handling negative return values from
cgroup_path_ns_locked().  Fix it.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

e0223003

24 9月, 2016 1 次提交

cgroup: fix invalid controller enable rejections with cgroup namespace · 9157056d

由 Tejun Heo 提交于 9月 23, 2016

On the v2 hierarchy, "cgroup.subtree_control" rejects controller
enables if the cgroup has processes in it.  The enforcement of this
logic assumes that the cgroup wouldn't have any css_sets associated
with it if there are no tasks in the cgroup, which is no longer true
since a79a908f ("cgroup: introduce cgroup namespaces").

When a cgroup namespace is created, it pins the css_set of the
creating task to use it as the root css_set of the namespace.  This
extra reference stays as long as the namespace is around and makes
"cgroup.subtree_control" think that the namespace root cgroup is not
empty even when it is and thus reject controller enables.

Fix it by making cgroup_subtree_control() walk and test emptiness of
each css_set instead of testing whether the list_head is empty.

While at it, update the comment of cgroup_task_count() to indicate
that the returned value may be higher than the number of tasks, which
has always been true due to temporary references and doesn't break
anything.
Signed-off-by: NTejun Heo <tj@kernel.org>
Reported-by: NEvgeny Vereshchagin <evvers@ya.ru>
Cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Cc: Aditya Kali <adityakali@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: stable@vger.kernel.org # v4.6+
Fixes: a79a908f ("cgroup: introduce cgroup namespaces")
Link: https://github.com/systemd/systemd/pull/3589#issuecomment-249089541

9157056d

23 9月, 2016 2 次提交

kernel: add a helper to get an owning user namespace for a namespace · bcac25a5

由 Andrey Vagin 提交于 9月 06, 2016

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: NSerge Hallyn <serge@hallyn.com>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

bcac25a5

userns: When the per user per user namespace limit is reached return ENOSPC · df75e774

由 Eric W. Biederman 提交于 9月 22, 2016

The current error codes returned when a the per user per user
namespace limit are hit (EINVAL, EUSERS, and ENFILE) are wrong. I
asked for advice on linux-api and it we made clear that those were
the wrong error code, but a correct effor code was not suggested.

The best general error code I have found for hitting a resource limit
is ENOSPC. It is not perfect but as it is unambiguous it will serve
until someone comes up with a better error code.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

df75e774

20 9月, 2016 1 次提交

cgroup: duplicate cgroup reference when cloning sockets · d979a39d

由 Johannes Weiner 提交于 9月 19, 2016

When a socket is cloned, the associated sock_cgroup_data is duplicated
but not its reference on the cgroup.  As a result, the cgroup reference
count will underflow when both sockets are destroyed later on.

Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
Link: http://lkml.kernel.org/r/20160914194846.11153-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
Cc: <stable@vger.kernel.org>	[4.5+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d979a39d