1. 06 4月, 2019 1 次提交
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · d0bc74c5
      Oleg Nesterov 提交于
      [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d0bc74c5
  2. 24 3月, 2019 1 次提交
    • A
      fix cgroup_do_mount() handling of failure exits · 7a8b0484
      Al Viro 提交于
      commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.
      
      same story as with last May fixes in sysfs (7b745a4e
      "unfuck sysfs_mount()"); new_sb is left uninitialized
      in case of early errors in kernfs_mount_ns() and papering
      over it by treating any error from kernfs_mount_ns() as
      equivalent to !new_ns ends up conflating the cases when
      objects had never been transferred to a superblock with
      ones when that has happened and resulting new superblock
      had been dropped.  Easily fixed (same way as in sysfs
      case).  Additionally, there's a superblock leak on
      kernfs_node_dentry() failure *and* a dentry leak inside
      kernfs_node_dentry() itself - the latter on probably
      impossible errors, but the former not impossible to trigger
      (as the matter of fact, injecting allocation failures
      at that point *does* trigger it).
      
      Cc: stable@kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a8b0484
  3. 13 2月, 2019 1 次提交
    • O
      cgroup: fix parsing empty mount option string · 4b5abffd
      Ondrej Mosnacek 提交于
      [ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ]
      
      This fixes the case where all mount options specified are consumed by an
      LSM and all that's left is an empty string. In this case cgroupfs should
      accept the string and not fail.
      
      How to reproduce (with SELinux enabled):
      
          # umount /sys/fs/cgroup/unified
          # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
          mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
          # dmesg | tail -n 1
          [   31.575952] cgroup: cgroup2: unknown option ""
      
      Fixes: 67e9c74b ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
      [NOTE: should apply on top of commit 5136f636 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
      Suggested-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4b5abffd
  4. 10 1月, 2019 1 次提交
    • T
      cgroup: fix CSS_TASK_ITER_PROCS · 8a2fbdd5
      Tejun Heo 提交于
      commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.
      
      CSS_TASK_ITER_PROCS implements process-only iteration by making
      css_task_iter_advance() skip tasks which aren't threadgroup leaders;
      however, when an iteration is started css_task_iter_start() calls the
      inner helper function css_task_iter_advance_css_set() instead of
      css_task_iter_advance().  As the helper doesn't have the skip logic,
      when the first task to visit is a non-leader thread, it doesn't get
      skipped correctly as shown in the following example.
      
        # ps -L 2030
          PID   LWP TTY      STAT   TIME COMMAND
         2030  2030 pts/0    Sl+    0:00 ./test-thread
         2030  2031 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2030
        2031
        # cat /sys/fs/cgroup/x/cgroup.procs
        2030
        # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2031
        2030
      
      The last read of cgroup.procs is incorrectly showing non-leader 2031
      in cgroup.procs output.
      
      This can be fixed by updating css_task_iter_advance() to handle the
      first advance and css_task_iters_tart() to call
      css_task_iter_advance() instead of the inner helper.  After the fix,
      the same commands result in the following (correct) result:
      
        # ps -L 2062
          PID   LWP TTY      STAT   TIME COMMAND
         2062  2062 pts/0    Sl+    0:00 ./test-thread
         2062  2063 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2062
        2063
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
        # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a2fbdd5
  5. 05 10月, 2018 1 次提交
    • T
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo 提交于
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: NAmin Jamali <ajamali@pivotal.io>
      Reported-by: NJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  6. 21 7月, 2018 1 次提交
  7. 12 7月, 2018 1 次提交
    • S
      cgroup/tracing: Move taking of spin lock out of trace event handlers · e4f8d81c
      Steven Rostedt (VMware) 提交于
      It is unwise to take spin locks from the handlers of trace events.
      Mainly, because they can introduce lockups, because it introduces locks
      in places that are normally not tested. Worse yet, because trace events
      are tucked away in the include/trace/events/ directory, locks that are
      taken there are forgotten about.
      
      As a general rule, I tell people never to take any locks in a trace
      event handler.
      
      Several cgroup trace event handlers call cgroup_path() which eventually
      takes the kernfs_rename_lock spinlock. This injects the spinlock in the
      code without people realizing it. It also can cause issues for the
      PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
      handlers are called with preemption disabled.
      
      By moving the calculation of the cgroup_path() out of the trace event
      handlers and into a macro (surrounded by a
      trace_cgroup_##type##_enabled()), then we could place the cgroup_path
      into a string, and pass that to the trace event. Not only does this
      remove the taking of the spinlock out of the trace event handler, but
      it also means that the cgroup_path() only needs to be called once (it
      is currently called twice, once to get the length to reserver the
      buffer for, and once again to get the path itself. Now it only needs to
      be done once.
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4f8d81c
  8. 07 6月, 2018 1 次提交
    • K
      treewide: Use struct_size() for kmalloc()-family · acafe7e3
      Kees Cook 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
      uses. It was done via automatic conversion with manual review for the
      "CHECKME" non-standard cases noted below, using the following Coccinelle
      script:
      
      // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
      //                      sizeof *pkey_cache->table, GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // Same pattern, but can't trivially locate the trailing element name,
      // or variable name.
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      expression SOMETHING, COUNT, ELEMENT;
      @@
      
      - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
      + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      acafe7e3
  9. 24 5月, 2018 1 次提交
    • T
      cgroup: css_set_lock should nest inside tasklist_lock · d8742e22
      Tejun Heo 提交于
      cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
      tasklist_lock inside irq-safe css_set_lock triggering the following
      lockdep warning.
      
        WARNING: possible irq lock inversion dependency detected
        4.17.0-rc1-00027-gb37d049 #6 Not tainted
        --------------------------------------------------------
        systemd/1 just changed the state of lock:
        00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
        but this lock took another, SOFTIRQ-unsafe lock in the past:
         (tasklist_lock){.+.+}
      
        and interrupts could create inverse lock ordering between them.
      
        other info that might help us debug this:
         Possible interrupt unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(tasklist_lock);
      				 local_irq_disable();
      				 lock(css_set_lock);
      				 lock(tasklist_lock);
          <Interrupt>
            lock(css_set_lock);
      
         *** DEADLOCK ***
      
      The condition is highly unlikely to actually happen especially given
      that the path is executed only once per boot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NBoqun Feng <boqun.feng@gmail.com>
      d8742e22
  10. 16 5月, 2018 1 次提交
  11. 27 4月, 2018 5 次提交
    • T
      cgroup: Add cgroup_subsys->css_rstat_flush() · 8f53470b
      Tejun Heo 提交于
      This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
      this callback, its csses are linked on cgrp->css_rstat_list and rstat
      will call the function whenever the associated cgroup is flushed.
      Flush is also performed when such csses are released so that residual
      counts aren't lost.
      
      Combined with the rstat API previous patches factored out, this allows
      controllers to plug into rstat to manage their statistics in a
      scalable way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8f53470b
    • T
      cgroup: Distinguish base resource stat implementation from rstat · d4ff749b
      Tejun Heo 提交于
      Base resource stat accounts universial (not specific to any
      controller) resource consumptions on top of rstat.  Currently, its
      implementation is intermixed with rstat implementation making the code
      confusing to follow.
      
      This patch clarifies the distintion by doing the followings.
      
      * Encapsulate base resource stat counters, currently only cputime, in
        struct cgroup_base_stat.
      
      * Move prev_cputime into struct cgroup and initialize it with cgroup.
      
      * Rename the related functions so that they start with cgroup_base_stat.
      
      * Prefix the related variables and field names with b.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d4ff749b
    • T
      cgroup: Rename stat to rstat · c58632b3
      Tejun Heo 提交于
      stat is too generic a name and ends up causing subtle confusions.
      It'll be made generic so that controllers can plug into it, which will
      make the problem worse.  Let's rename it to something more specific -
      cgroup_rstat for cgroup recursive stat.
      
      This patch does the following renames.  No other changes.
      
      * cpu_stat	-> rstat_cpu
      * stat		-> rstat
      * ?cstat	-> ?rstatc
      
      Note that the renames are selective.  The unrenamed are the ones which
      implement basic resource statistics on top of rstat.  This will be
      further cleaned up in the following patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c58632b3
    • T
      cgroup: Limit event generation frequency · b12e3583
      Tejun Heo 提交于
      ".events" files generate file modified event to notify userland of
      possible new events.  Some of the events can be quite bursty
      (e.g. memory high event) and generating notification each time is
      costly and pointless.
      
      This patch implements a event rate limit mechanism.  If a new
      notification is requested before 10ms has passed since the previous
      notification, the new notification is delayed till then.
      
      As this only delays from the second notification on in a given close
      cluster of notifications, userland reactions to notifications
      shouldn't be delayed at all in most cases while avoiding notification
      storms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b12e3583
    • T
      cgroup: Explicitly remove core interface files · 5faaf05f
      Tejun Heo 提交于
      The "cgroup." core interface files bypass the usual interface removal
      path and get removed recursively along with the cgroup itself.  While
      this works now, the subtle discrepancy gets in the way of implementing
      common mechanisms.
      
      This patch updates cgroup core interface file handling so that it's
      consistent with controller interface files.  When added, the css is
      marked CSS_VISIBLE and they're explicitly removed before the cgroup is
      destroyed.
      
      This doesn't cause user-visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5faaf05f
  12. 20 3月, 2018 1 次提交
  13. 22 2月, 2018 1 次提交
    • T
      cgroup: fix rule checking for threaded mode switching · d1897c95
      Tejun Heo 提交于
      A domain cgroup isn't allowed to be turned threaded if its subtree is
      populated or domain controllers are enabled.  cgroup_enable_threaded()
      depended on cgroup_can_be_thread_root() test to enforce this rule.  A
      parent which has populated domain descendants or have domain
      controllers enabled can't become a thread root, so the above rules are
      enforced automatically.
      
      However, for the root cgroup which can host mixed domain and threaded
      children, cgroup_can_be_thread_root() doesn't check any of those
      conditions and thus first level cgroups ends up escaping those rules.
      
      This patch fixes the bug by adding explicit checks for those rules in
      cgroup_enable_threaded().
      Reported-by: NMichael Kerrisk (man-pages) <mtk.manpages@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
      Cc: stable@vger.kernel.org # v4.14+
      d1897c95
  14. 20 1月, 2018 1 次提交
    • T
      string: drop __must_check from strscpy() and restore strscpy() usages in cgroup · 08a77676
      Tejun Heo 提交于
      e7fd37ba ("cgroup: avoid copying strings longer than the buffers")
      converted possibly unsafe strncpy() usages in cgroup to strscpy().
      However, although the callsites are completely fine with truncated
      copied, because strscpy() is marked __must_check, it led to the
      following warnings.
      
        kernel/cgroup/cgroup.c: In function ‘cgroup_file_name’:
        kernel/cgroup/cgroup.c:1400:10: warning: ignoring return value of ‘strscpy’, declared with attribute warn_unused_result [-Wunused-result]
           strscpy(buf, cft->name, CGROUP_FILE_NAME_MAX);
      	       ^
      
      To avoid the warnings, 50034ed4 ("cgroup: use strlcpy() instead of
      strscpy() to avoid spurious warning") switched them to strlcpy().
      
      strlcpy() is worse than strlcpy() because it unconditionally runs
      strlen() on the source string, and the only reason we switched to
      strlcpy() here was because it was lacking __must_check, which doesn't
      reflect any material differences between the two function.  It's just
      that someone added __must_check to strscpy() and not to strlcpy().
      
      These basic string copy operations are used in variety of ways, and
      one of not-so-uncommon use cases is safely handling truncated copies,
      where the caller naturally doesn't care about the return value.  The
      __must_check doesn't match the actual use cases and forces users to
      opt for inferior variants which lack __must_check by happenstance or
      spread ugly (void) casts.
      
      Remove __must_check from strscpy() and restore strscpy() usages in
      cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      08a77676
  15. 11 1月, 2018 1 次提交
  16. 20 12月, 2017 1 次提交
    • T
      cgroup: fix css_task_iter crash on CSS_TASK_ITER_PROC · 74d0833c
      Tejun Heo 提交于
      While teaching css_task_iter to handle skipping over tasks which
      aren't group leaders, bc2fb7ed ("cgroup: add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS") introduced a
      silly bug.
      
      CSS_TASK_ITER_PROCS is implemented by repeating
      css_task_iter_advance() while the advanced cursor is pointing to a
      non-leader thread.  However, the cursor variable, @l, wasn't updated
      when the iteration has to advance to the next css_set and the
      following repetition would operate on the terminal @l from the
      previous iteration which isn't pointing to a valid task leading to
      oopses like the following or infinite looping.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000254
        IP: __task_pid_nr_ns+0xc7/0xf0
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        ...
        CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.4-200.fc26.x86_64 #1
        Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 3203 11/09/2017
        task: ffff88c4baee8000 task.stack: ffff96d5c3158000
        RIP: 0010:__task_pid_nr_ns+0xc7/0xf0
        RSP: 0018:ffff96d5c315bd50 EFLAGS: 00010206
        RAX: 0000000000000000 RBX: ffff88c4b68c6000 RCX: 0000000000000250
        RDX: ffffffffa5e47960 RSI: 0000000000000000 RDI: ffff88c490f6ab00
        RBP: ffff96d5c315bd50 R08: 0000000000001000 R09: 0000000000000005
        R10: ffff88c4be006b80 R11: ffff88c42f1b8004 R12: ffff96d5c315bf18
        R13: ffff88c42d7dd200 R14: ffff88c490f6a510 R15: ffff88c4b68c6000
        FS:  00007f9446f8ea00(0000) GS:ffff88c4be680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000254 CR3: 00000007f956f000 CR4: 00000000003406e0
        Call Trace:
         cgroup_procs_show+0x19/0x30
         cgroup_seqfile_show+0x4c/0xb0
         kernfs_seq_show+0x21/0x30
         seq_read+0x2ec/0x3f0
         kernfs_fop_read+0x134/0x180
         __vfs_read+0x37/0x160
         ? security_file_permission+0x9b/0xc0
         vfs_read+0x8e/0x130
         SyS_read+0x55/0xc0
         entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x7f94455f942d
        RSP: 002b:00007ffe81ba2d00 EFLAGS: 00000293 ORIG_RAX: 0000000000000000
        RAX: ffffffffffffffda RBX: 00005574e2233f00 RCX: 00007f94455f942d
        RDX: 0000000000001000 RSI: 00005574e2321a90 RDI: 000000000000002b
        RBP: 0000000000000000 R08: 00005574e2321a90 R09: 00005574e231de60
        R10: 00007f94458c8b38 R11: 0000000000000293 R12: 00007f94458c8ae0
        R13: 00007ffe81ba3800 R14: 0000000000000000 R15: 00005574e2116560
        Code: 04 74 0e 89 f6 48 8d 04 76 48 8d 04 c5 f0 05 00 00 48 8b bf b8 05 00 00 48 01 c7 31 c0 48 8b 0f 48 85 c9 74 18 8b b2 30 08 00 00 <3b> 71 04 77 0d 48 c1 e6 05 48 01 f1 48 3b 51 38 74 09 5d c3 8b
        RIP: __task_pid_nr_ns+0xc7/0xf0 RSP: ffff96d5c315bd50
      
      Fix it by moving the initialization of the cursor below the repeat
      label.  While at it, rename it to @next for readability.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: bc2fb7ed ("cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS")
      Cc: stable@vger.kernel.org # v4.14+
      Reported-by: NLaura Abbott <labbott@redhat.com>
      Reported-by: NBronek Kozicki <brok@incorrekt.com>
      Reported-by: NGeorge Amanakis <gamanakis@gmail.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      74d0833c
  17. 15 12月, 2017 1 次提交
  18. 12 12月, 2017 1 次提交
  19. 07 11月, 2017 2 次提交
    • R
      cgroup: export list of cgroups v2 features using sysfs · 5f2e6734
      Roman Gushchin 提交于
      The active development of cgroups v2 sometimes leads to a creation
      of interfaces, which are not turned on by default (to provide
      backward compatibility). It's handy to know from userspace, which
      cgroup v2 features are supported without calculating it based
      on the kernel version. So, let's export the list of such features
      using /sys/kernel/cgroup/features pseudo-file.
      
      The list is hardcoded and has to be extended when new functionality
      is added. Each feature is printed on a new line.
      
      Example:
        $ cat /sys/kernel/cgroup/features
        nsdelegate
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5f2e6734
    • R
      cgroup: export list of delegatable control files using sysfs · 01ee6cfb
      Roman Gushchin 提交于
      Delegatable cgroup v2 control files may require special handling
      (e.g. chowning), and the exact list of such files varies between
      kernel versions (and likely to be extended in the future).
      
      To guarantee correctness of this list and simplify the life
      of userspace (systemd, first of all), let's export the list
      via /sys/kernel/cgroup/delegate pseudo-file.
      
      Format is siple: each control file name is printed on a new line.
      Example:
        $ cat /sys/kernel/cgroup/delegate
        cgroup.procs
        cgroup.subtree_control
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NTejun Heo <tj@kernel.org>
      01ee6cfb
  20. 30 10月, 2017 1 次提交
  21. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  22. 11 10月, 2017 1 次提交
  23. 10 10月, 2017 1 次提交
  24. 05 10月, 2017 2 次提交
    • A
      bpf: introduce BPF_PROG_QUERY command · 468e2f64
      Alexei Starovoitov 提交于
      introduce BPF_PROG_QUERY command to retrieve a set of either
      attached programs to given cgroup or a set of effective programs
      that will execute for events within a cgroup
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      468e2f64
    • A
      bpf: multi program support for cgroup+bpf · 324bda9e
      Alexei Starovoitov 提交于
      introduce BPF_F_ALLOW_MULTI flag that can be used to attach multiple
      bpf programs to a cgroup.
      
      The difference between three possible flags for BPF_PROG_ATTACH command:
      - NONE(default): No further bpf programs allowed in the subtree.
      - BPF_F_ALLOW_OVERRIDE: If a sub-cgroup installs some bpf program,
        the program in this cgroup yields to sub-cgroup program.
      - BPF_F_ALLOW_MULTI: If a sub-cgroup installs some bpf program,
        that cgroup program gets run in addition to the program in this cgroup.
      
      NONE and BPF_F_ALLOW_OVERRIDE existed before. This patch doesn't
      change their behavior. It only clarifies the semantics in relation
      to new flag.
      
      Only one program is allowed to be attached to a cgroup with
      NONE or BPF_F_ALLOW_OVERRIDE flag.
      Multiple programs are allowed to be attached to a cgroup with
      BPF_F_ALLOW_MULTI flag. They are executed in FIFO order
      (those that were attached first, run first)
      The programs of sub-cgroup are executed first, then programs of
      this cgroup and then programs of parent cgroup.
      All eligible programs are executed regardless of return code from
      earlier programs.
      
      To allow efficient execution of multiple programs attached to a cgroup
      and to avoid penalizing cgroups without any programs attached
      introduce 'struct bpf_prog_array' which is RCU protected array
      of pointers to bpf programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      for cgroup bits
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      324bda9e
  25. 26 9月, 2017 1 次提交
    • T
      cgroup: statically initialize init_css_set->dfl_cgrp · 38683148
      Tejun Heo 提交于
      Like other csets, init_css_set's dfl_cgrp is initialized when the cset
      gets linked.  init_css_set gets linked in cgroup_init().  This has
      been fine till now but the recently added basic CPU usage accounting
      may end up accessing dfl_cgrp of init before cgroup_init() leading to
      the following oops.
      
        SELinux:  Initializing.
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
        IP: account_system_index_time+0x60/0x90
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-rc2-00003-g041cd640 #10
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
        +1.9.3-20161025_171302-gandalf 04/01/2014
        task: ffffffff81e10480 task.stack: ffffffff81e00000
        RIP: 0010:account_system_index_time+0x60/0x90
        RSP: 0000:ffff880011e03cb8 EFLAGS: 00010002
        RAX: ffffffff81ef8800 RBX: ffffffff81e10480 RCX: 0000000000000003
        RDX: 0000000000000000 RSI: 00000000000f4240 RDI: 0000000000000000
        RBP: ffff880011e03cc0 R08: 0000000000010000 R09: 0000000000000000
        R10: 0000000000000020 R11: 0000003b9aca0000 R12: 000000000001c100
        R13: 0000000000000000 R14: ffffffff81e10480 R15: ffffffff81e03cd8
        FS:  0000000000000000(0000) GS:ffff880011e00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00000000000000b0 CR3: 0000000001e09000 CR4: 00000000000006b0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         <IRQ>
         account_system_time+0x45/0x60
         account_process_tick+0x5a/0x140
         update_process_times+0x22/0x60
         tick_periodic+0x2b/0x90
         tick_handle_periodic+0x25/0x70
         timer_interrupt+0x15/0x20
         __handle_irq_event_percpu+0x7e/0x1b0
         handle_irq_event_percpu+0x23/0x60
         handle_irq_event+0x42/0x70
         handle_level_irq+0x83/0x100
         handle_irq+0x6f/0x110
         do_IRQ+0x46/0xd0
         common_interrupt+0x9d/0x9d
      
      Fix it by statically initializing init_css_set.dfl_cgrp so that init's
      default cgroup is accessible from the get-go.
      
      Fixes: 041cd640 ("cgroup: Implement cgroup2 basic CPU usage accounting")
      Reported-by: N“kbuild-all@01.org” <kbuild-all@01.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      38683148
  26. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  27. 22 9月, 2017 1 次提交
    • W
      cgroup: Reinit cgroup_taskset structure before cgroup_migrate_execute() returns · c4fa6c43
      Waiman Long 提交于
      The cgroup_taskset structure within the larger cgroup_mgctx structure
      is supposed to be used once and then discarded. That is not really the
      case in the hotplug code path:
      
      cpuset_hotplug_workfn()
       - cgroup_transfer_tasks()
         - cgroup_migrate()
           - cgroup_migrate_add_task()
           - cgroup_migrate_execute()
      
      In this case, the cgroup_migrate() function is called multiple time
      with the same cgroup_mgctx structure to transfer the tasks from
      one cgroup to another one-by-one. The second time cgroup_migrate()
      is called, the cgroup_taskset will be in an incorrect state and so
      may cause the system to panic. For example,
      
        [  150.888410] Faulting instruction address: 0xc0000000001db648
        [  150.888414] Oops: Kernel access of bad area, sig: 11 [#1]
        [  150.888417] SMP NR_CPUS=2048
        [  150.888417] NUMA
        [  150.888419] pSeries
          :
        [  150.888545] NIP [c0000000001db648] cpuset_can_attach+0x58/0x1b0
        [  150.888548] LR [c0000000001db638] cpuset_can_attach+0x48/0x1b0
        [  150.888551] Call Trace:
        [  150.888554] [c0000005f65cb940] [c0000000001db638] cpuset_can_attach+0x48/0x1b 0 (unreliable)
        [  150.888559] [c0000005f65cb9a0] [c0000000001cff04] cgroup_migrate_execute+0xc4/0x4b0
        [  150.888563] [c0000005f65cba20] [c0000000001d7d14] cgroup_transfer_tasks+0x1d4/0x370
        [  150.888568] [c0000005f65cbb70] [c0000000001ddcb0] cpuset_hotplug_workfn+0x710/0x8f0
        [  150.888572] [c0000005f65cbc80] [c00000000012032c] process_one_work+0x1ac/0x4d0
        [  150.888576] [c0000005f65cbd20] [c0000000001206f8] worker_thread+0xa8/0x5b0
        [  150.888580] [c0000005f65cbdc0] [c0000000001293f8] kthread+0x168/0x1b0
        [  150.888584] [c0000005f65cbe30] [c00000000000b368] ret_from_kernel_thread+0x5c/0x74
      
      To allow reuse of the cgroup_mgctx structure, some fields in that
      structure are now re-initialized at the end of cgroup_migrate_execute()
      function call so that the structure can be reused again in a later
      iteration without causing problem.
      
      This bug was introduced in the commit e595cd70 ("group: track
      migration context in cgroup_mgctx") in 4.11. This commit moves the
      cgroup_taskset initialization out of cgroup_migrate(). The commit
      10467270fb3 ("cgroup: don't call migration methods if there are no
      tasks to migrate") helped, but did not completely resolve the problem.
      
      Fixes: e595cd70 ("group: track migration context in cgroup_mgctx")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v4.11+
      c4fa6c43
  28. 07 9月, 2017 1 次提交
  29. 12 8月, 2017 1 次提交
  30. 11 8月, 2017 1 次提交
    • T
      cgroup: misc changes · 3e48930c
      Tejun Heo 提交于
      Misc trivial changes to prepare for future changes.  No functional
      difference.
      
      * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
      
      * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
      
      * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
        with the file name.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3e48930c
  31. 03 8月, 2017 4 次提交
    • T
      cgroup: short-circuit cset_cgroup_from_root() on the default hierarchy · 13d82fb7
      Tejun Heo 提交于
      Each css_set directly points to the default cgroup it belongs to, so
      there's no reason to walk the cgrp_links list on the default
      hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      13d82fb7
    • R
      cgroup: re-use the parent pointer in cgroup_destroy_locked() · 5a621e6c
      Roman Gushchin 提交于
      As we already have a pointer to the parent cgroup in
      cgroup_destroy_locked(), we don't need to calculate it again
      to pass as an argument for cgroup1_check_for_release().
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: linux-kernel@vger.kernel.org
      5a621e6c
    • R
      cgroup: add cgroup.stat interface with basic hierarchy stats · ec39225c
      Roman Gushchin 提交于
      A cgroup can consume resources even after being deleted by a user.
      For example, writing back dirty pages should be accounted and
      limited, despite the corresponding cgroup might contain no processes
      and being deleted by a user.
      
      In the current implementation a cgroup can remain in such "dying" state
      for an undefined amount of time. For instance, if a memory cgroup
      contains a pge, mlocked by a process belonging to an other cgroup.
      
      Although the lifecycle of a dying cgroup is out of user's control,
      it's important to have some insight of what's going on under the hood.
      
      In particular, it's handy to have a counter which will allow
      to detect css leaks.
      
      To solve this problem, add a cgroup.stat interface to
      the base cgroup control files with the following metrics:
      
      nr_descendants		total number of visible descendant cgroups
      nr_dying_descendants	total number of dying descendant cgroups
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      ec39225c
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b