1. 25 12月, 2019 1 次提交
  2. 07 11月, 2019 2 次提交
  3. 30 10月, 2019 5 次提交
  4. 09 8月, 2019 4 次提交
    • T
      cgroup: Fix css_task_iter_advance_css_set() cset skip condition · ebda41dd
      Tejun Heo 提交于
      commit c596687a008b579c503afb7a64fcacc7270fae9e upstream.
      
      While adding handling for dying task group leaders c03cd7738a83
      ("cgroup: Include dying leaders with live threads in PROCS
      iterations") added an inverted cset skip condition to
      css_task_iter_advance_css_set().  It should skip cset if it's
      completely empty but was incorrectly testing for the inverse condition
      for the dying_tasks list.  Fix it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS iterations")
      Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebda41dd
    • T
      cgroup: css_task_iter_skip()'d iterators must be advanced before accessed · 0a9abd27
      Tejun Heo 提交于
      commit cee0c33c546a93957a52ae9ab6bebadbee765ec5 upstream.
      
      b636fd38dc40 ("cgroup: Implement css_task_iter_skip()") introduced
      css_task_iter_skip() which is used to fix task iterations skipping
      dying threadgroup leaders with live threads.  Skipping is implemented
      as a subportion of full advancing but css_task_iter_next() forgot to
      fully advance a skipped iterator before determining the next task to
      visit causing it to return invalid task pointers.
      
      Fix it by making css_task_iter_next() fully advance the iterator if it
      has been skipped since the previous iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: syzbot
      Link: http://lkml.kernel.org/r/00000000000097025d058a7fd785@google.com
      Fixes: b636fd38dc40 ("cgroup: Implement css_task_iter_skip()")
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a9abd27
    • T
      cgroup: Include dying leaders with live threads in PROCS iterations · 4340d175
      Tejun Heo 提交于
      commit c03cd7738a83b13739f00546166969342c8ff014 upstream.
      
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4340d175
    • T
      cgroup: Implement css_task_iter_skip() · 370b9e63
      Tejun Heo 提交于
      commit b636fd38dc40113f853337a7d2a6885ad23b8811 upstream.
      
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      370b9e63
  5. 10 7月, 2019 1 次提交
    • J
      cpuset: restore sanity to cpuset_cpus_allowed_fallback() · b7747ecb
      Joel Savitz 提交于
      [ Upstream commit d477f8c202d1f0d4791ab1263ca7657bbe5cf79e ]
      
      In the case that a process is constrained by taskset(1) (i.e.
      sched_setaffinity(2)) to a subset of available cpus, and all of those are
      subsequently offlined, the scheduler will set tsk->cpus_allowed to
      the current value of task_cs(tsk)->effective_cpus.
      
      This is done via a call to do_set_cpus_allowed() in the context of
      cpuset_cpus_allowed_fallback() made by the scheduler when this case is
      detected. This is the only call made to cpuset_cpus_allowed_fallback()
      in the latest mainline kernel.
      
      However, this is not sane behavior.
      
      I will demonstrate this on a system running the latest upstream kernel
      with the following initial configuration:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,fffffff
      	Cpus_allowed_list:	0-63
      
      (Where cpus 32-63 are provided via smt.)
      
      If we limit our current shell process to cpu2 only and then offline it
      and reonline it:
      
      	# taskset -p 4 $$
      	pid 2272's current affinity mask: ffffffffffffffff
      	pid 2272's new affinity mask: 4
      
      	# echo off > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -3
      	[ 2195.866089] process 2272 (bash) no longer affine to cpu2
      	[ 2195.872700] IRQ 114: no longer affine to CPU2
      	[ 2195.879128] smpboot: CPU 2 is now offline
      
      	# echo on > /sys/devices/system/cpu/cpu2/online
      	# dmesg | tail -1
      	[ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
      
      We see that our current process now has an affinity mask containing
      every cpu available on the system _except_ the one we originally
      constrained it to:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:   ffffffff,fffffffb
      	Cpus_allowed_list:      0-1,3-63
      
      This is not sane behavior, as the scheduler can now not only place the
      process on previously forbidden cpus, it can't even schedule it on
      the cpu it was originally constrained to!
      
      Other cases result in even more exotic affinity masks. Take for instance
      a process with an affinity mask containing only cpus provided by smt at
      the moment that smt is toggled, in a configuration such as the following:
      
      	# taskset -p f000000000 $$
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	000000f0,00000000
      	Cpus_allowed_list:	36-39
      
      A double toggle of smt results in the following behavior:
      
      	# echo off > /sys/devices/system/cpu/smt/control
      	# echo on > /sys/devices/system/cpu/smt/control
      	# grep -i cpus /proc/$$/status
      	Cpus_allowed:	ffffff00,ffffffff
      	Cpus_allowed_list:	0-31,40-63
      
      This is even less sane than the previous case, as the new affinity mask
      excludes all smt-provided cpus with ids less than those that were
      previously in the affinity mask, as well as those that were actually in
      the mask.
      
      With this patch applied, both of these cases end in the following state:
      
      	# grep -i cpu /proc/$$/status
      	Cpus_allowed:	ffffffff,ffffffff
      	Cpus_allowed_list:	0-63
      
      The original policy is discarded. Though not ideal, it is the simplest way
      to restore sanity to this fallback case without reinventing the cpuset
      wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
      for the previous affinity mask to be restored in this fallback case can use
      that mechanism instead.
      
      This patch modifies scheduler behavior by instead resetting the mask to
      task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
      mode. I tested the cases above on both modes.
      
      Note that the scheduler uses this fallback mechanism if and only if
      _every_ other valid avenue has been traveled, and it is the last resort
      before calling BUG().
      Suggested-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NJoel Savitz <jsavitz@redhat.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b7747ecb
  6. 31 5月, 2019 1 次提交
    • R
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4e4d5cea
      Roman Gushchin 提交于
      [ Upstream commit 4dcabece4c3a9f9522127be12cc12cc120399b2f ]
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4e4d5cea
  7. 06 4月, 2019 2 次提交
    • O
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · d0bc74c5
      Oleg Nesterov 提交于
      [ Upstream commit 51bee5abeab2058ea5813c5615d6197a23dbf041 ]
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: NHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d0bc74c5
    • T
      cgroup, rstat: Don't flush subtree root unless necessary · a74ebf04
      Tejun Heo 提交于
      [ Upstream commit b4ff1b44bcd384d22fcbac6ebaf9cc0d33debe50 ]
      
      cgroup_rstat_cpu_pop_updated() is used to traverse the updated cgroups
      on flush.  While it was only visiting updated ones in the subtree, it
      was visiting @root unconditionally.  We can easily check whether @root
      is updated or not by looking at its ->updated_next just as with the
      cgroups in the subtree.
      
      * Remove the unnecessary cgroup_parent() test.  The system root cgroup
        is never updated and thus its ->updated_next is always NULL.  No
        need to test whether cgroup_parent() exists in addition to
        ->updated_next.
      
      * Terminate traverse if ->updated_next is NULL.  This can only happen
        for subtree @root and there's no reason to visit it if it's not
        marked updated.
      
      This reduces cpu consumption when reading a lot of rstat backed files.
      In a micro benchmark reading stat from ~1600 cgroups, the sys time was
      lowered by >40%.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a74ebf04
  8. 24 3月, 2019 1 次提交
    • A
      fix cgroup_do_mount() handling of failure exits · 7a8b0484
      Al Viro 提交于
      commit 399504e21a10be16dd1408ba0147367d9d82a10c upstream.
      
      same story as with last May fixes in sysfs (7b745a4e
      "unfuck sysfs_mount()"); new_sb is left uninitialized
      in case of early errors in kernfs_mount_ns() and papering
      over it by treating any error from kernfs_mount_ns() as
      equivalent to !new_ns ends up conflating the cases when
      objects had never been transferred to a superblock with
      ones when that has happened and resulting new superblock
      had been dropped.  Easily fixed (same way as in sysfs
      case).  Additionally, there's a superblock leak on
      kernfs_node_dentry() failure *and* a dentry leak inside
      kernfs_node_dentry() itself - the latter on probably
      impossible errors, but the former not impossible to trigger
      (as the matter of fact, injecting allocation failures
      at that point *does* trigger it).
      
      Cc: stable@kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7a8b0484
  9. 13 2月, 2019 1 次提交
    • O
      cgroup: fix parsing empty mount option string · 4b5abffd
      Ondrej Mosnacek 提交于
      [ Upstream commit e250d91d65750a0c0c62483ac4f9f357e7317617 ]
      
      This fixes the case where all mount options specified are consumed by an
      LSM and all that's left is an empty string. In this case cgroupfs should
      accept the string and not fail.
      
      How to reproduce (with SELinux enabled):
      
          # umount /sys/fs/cgroup/unified
          # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
          mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
          # dmesg | tail -n 1
          [   31.575952] cgroup: cgroup2: unknown option ""
      
      Fixes: 67e9c74b ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
      [NOTE: should apply on top of commit 5136f636 ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
      Suggested-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4b5abffd
  10. 10 1月, 2019 1 次提交
    • T
      cgroup: fix CSS_TASK_ITER_PROCS · 8a2fbdd5
      Tejun Heo 提交于
      commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.
      
      CSS_TASK_ITER_PROCS implements process-only iteration by making
      css_task_iter_advance() skip tasks which aren't threadgroup leaders;
      however, when an iteration is started css_task_iter_start() calls the
      inner helper function css_task_iter_advance_css_set() instead of
      css_task_iter_advance().  As the helper doesn't have the skip logic,
      when the first task to visit is a non-leader thread, it doesn't get
      skipped correctly as shown in the following example.
      
        # ps -L 2030
          PID   LWP TTY      STAT   TIME COMMAND
         2030  2030 pts/0    Sl+    0:00 ./test-thread
         2030  2031 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2030
        2031
        # cat /sys/fs/cgroup/x/cgroup.procs
        2030
        # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2031
        2030
      
      The last read of cgroup.procs is incorrectly showing non-leader 2031
      in cgroup.procs output.
      
      This can be fixed by updating css_task_iter_advance() to handle the
      first advance and css_task_iters_tart() to call
      css_task_iter_advance() instead of the inner helper.  After the fix,
      the same commands result in the following (correct) result:
      
        # ps -L 2062
          PID   LWP TTY      STAT   TIME COMMAND
         2062  2062 pts/0    Sl+    0:00 ./test-thread
         2062  2063 pts/0    Sl+    0:00 ./test-thread
        # mkdir -p /sys/fs/cgroup/x/a/b
        # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
        # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
        # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
        # cat /sys/fs/cgroup/x/a/cgroup.threads
        2062
        2063
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
        # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
        # cat /sys/fs/cgroup/x/cgroup.procs
        2062
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Fixes: 8cfd8147 ("cgroup: implement cgroup v2 thread support")
      Cc: stable@vger.kernel.org # v4.14+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8a2fbdd5
  11. 05 10月, 2018 1 次提交
    • T
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo 提交于
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: NAmin Jamali <ajamali@pivotal.io>
      Reported-by: NJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  12. 21 7月, 2018 1 次提交
  13. 12 7月, 2018 1 次提交
    • S
      cgroup/tracing: Move taking of spin lock out of trace event handlers · e4f8d81c
      Steven Rostedt (VMware) 提交于
      It is unwise to take spin locks from the handlers of trace events.
      Mainly, because they can introduce lockups, because it introduces locks
      in places that are normally not tested. Worse yet, because trace events
      are tucked away in the include/trace/events/ directory, locks that are
      taken there are forgotten about.
      
      As a general rule, I tell people never to take any locks in a trace
      event handler.
      
      Several cgroup trace event handlers call cgroup_path() which eventually
      takes the kernfs_rename_lock spinlock. This injects the spinlock in the
      code without people realizing it. It also can cause issues for the
      PREEMPT_RT patch, as the spinlock becomes a mutex, and the trace event
      handlers are called with preemption disabled.
      
      By moving the calculation of the cgroup_path() out of the trace event
      handlers and into a macro (surrounded by a
      trace_cgroup_##type##_enabled()), then we could place the cgroup_path
      into a string, and pass that to the trace event. Not only does this
      remove the taking of the spinlock out of the trace event handler, but
      it also means that the cgroup_path() only needs to be called once (it
      is currently called twice, once to get the length to reserver the
      buffer for, and once again to get the path itself. Now it only needs to
      be done once.
      Reported-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4f8d81c
  14. 16 6月, 2018 1 次提交
  15. 13 6月, 2018 2 次提交
    • K
      treewide: Use array_size() in vmalloc() · 42bc47b3
      Kees Cook 提交于
      The vmalloc() function has no 2-factor argument form, so multiplication
      factors need to be wrapped in array_size(). This patch replaces cases of:
      
              vmalloc(a * b)
      
      with:
              vmalloc(array_size(a, b))
      
      as well as handling cases of:
      
              vmalloc(a * b * c)
      
      with:
      
              vmalloc(array3_size(a, b, c))
      
      This does, however, attempt to ignore constant size factors like:
      
              vmalloc(4 * 1024)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        vmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        vmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        vmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        vmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_ID)
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_ID
      +	array_size(COUNT_ID, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT_CONST)
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT_CONST
      +	array_size(COUNT_CONST, sizeof(THING))
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
        vmalloc(
      -	SIZE * COUNT
      +	array_size(COUNT, SIZE)
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        vmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        vmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        vmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        vmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        vmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        vmalloc(C1 * C2 * C3, ...)
      |
        vmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants.
      @@
      expression E1, E2;
      constant C1, C2;
      @@
      
      (
        vmalloc(C1 * C2, ...)
      |
        vmalloc(
      -	E1 * E2
      +	array_size(E1, E2)
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      42bc47b3
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  16. 07 6月, 2018 1 次提交
    • K
      treewide: Use struct_size() for kmalloc()-family · acafe7e3
      Kees Cook 提交于
      One of the more common cases of allocation size calculations is finding
      the size of a structure that has a zero-sized array at the end, along
      with memory for some number of elements for that array. For example:
      
      struct foo {
          int stuff;
          void *entry[];
      };
      
      instance = kmalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
      
      Instead of leaving these open-coded and prone to type mistakes, we can
      now use the new struct_size() helper:
      
      instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);
      
      This patch makes the changes for kmalloc()-family (and kvmalloc()-family)
      uses. It was done via automatic conversion with manual review for the
      "CHECKME" non-standard cases noted below, using the following Coccinelle
      script:
      
      // pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
      //                      sizeof *pkey_cache->table, GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(*VAR->ELEMENT), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // mr = kzalloc(sizeof(*mr) + m * sizeof(mr->map[0]), GFP_KERNEL);
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      identifier VAR, ELEMENT;
      expression COUNT;
      @@
      
      - alloc(sizeof(*VAR) + COUNT * sizeof(VAR->ELEMENT[0]), GFP)
      + alloc(struct_size(VAR, ELEMENT, COUNT), GFP)
      
      // Same pattern, but can't trivially locate the trailing element name,
      // or variable name.
      @@
      identifier alloc =~ "kmalloc|kzalloc|kvmalloc|kvzalloc";
      expression GFP;
      expression SOMETHING, COUNT, ELEMENT;
      @@
      
      - alloc(sizeof(SOMETHING) + COUNT * sizeof(ELEMENT), GFP)
      + alloc(CHECKME_struct_size(&SOMETHING, ELEMENT, COUNT), GFP)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      acafe7e3
  17. 24 5月, 2018 1 次提交
    • T
      cgroup: css_set_lock should nest inside tasklist_lock · d8742e22
      Tejun Heo 提交于
      cgroup_enable_task_cg_lists() incorrectly nests non-irq-safe
      tasklist_lock inside irq-safe css_set_lock triggering the following
      lockdep warning.
      
        WARNING: possible irq lock inversion dependency detected
        4.17.0-rc1-00027-gb37d049 #6 Not tainted
        --------------------------------------------------------
        systemd/1 just changed the state of lock:
        00000000fe57773b (css_set_lock){..-.}, at: cgroup_free+0xf2/0x12a
        but this lock took another, SOFTIRQ-unsafe lock in the past:
         (tasklist_lock){.+.+}
      
        and interrupts could create inverse lock ordering between them.
      
        other info that might help us debug this:
         Possible interrupt unsafe locking scenario:
      
      	 CPU0                    CPU1
      	 ----                    ----
          lock(tasklist_lock);
      				 local_irq_disable();
      				 lock(css_set_lock);
      				 lock(tasklist_lock);
          <Interrupt>
            lock(css_set_lock);
      
         *** DEADLOCK ***
      
      The condition is highly unlikely to actually happen especially given
      that the path is executed only once per boot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NBoqun Feng <boqun.feng@gmail.com>
      d8742e22
  18. 16 5月, 2018 1 次提交
  19. 08 5月, 2018 1 次提交
  20. 27 4月, 2018 11 次提交
    • T
      cgroup: Make cgroup_rstat_updated() ready for root cgroup usage · c43c5ea7
      Tejun Heo 提交于
      cgroup_rstat_updated() ensures that the cgroup's rstat is linked to
      the parent.  If there's no parent, it never gets linked and the
      function ends up grabbing and releasing the cgroup_rstat_lock each
      time for no reason which can be expensive.
      
      This hasn't been a problem till now because nobody was calling the
      function for the root cgroup but rstat is gonna be exposed to
      controllers and use cases, so let's get ready.  Make
      cgroup_rstat_updated() an no-op for the root cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c43c5ea7
    • T
      cgroup: Add memory barriers to plug cgroup_rstat_updated() race window · 9a9e97b2
      Tejun Heo 提交于
      cgroup_rstat_updated() has a small race window where an updated
      signaling can race with flush and could be lost till the next update.
      This wasn't a problem for the existing usages, but we plan to use
      rstat to track counters which need to be accurate.
      
      This patch plugs the race window by synchronizing
      cgroup_rstat_updated() and flush path with memory barriers around
      cgroup_rstat_cpu->updated_next pointer.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      9a9e97b2
    • T
      cgroup: Add cgroup_subsys->css_rstat_flush() · 8f53470b
      Tejun Heo 提交于
      This patch adds cgroup_subsys->css_rstat_flush().  If a subsystem has
      this callback, its csses are linked on cgrp->css_rstat_list and rstat
      will call the function whenever the associated cgroup is flushed.
      Flush is also performed when such csses are released so that residual
      counts aren't lost.
      
      Combined with the rstat API previous patches factored out, this allows
      controllers to plug into rstat to manage their statistics in a
      scalable way.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      8f53470b
    • T
      cgroup: Replace cgroup_rstat_mutex with a spinlock · 0fa294fb
      Tejun Heo 提交于
      Currently, rstat flush path is protected with a mutex which is fine as
      all the existing users are from interface file show path.  However,
      rstat is being generalized for use by controllers and flushing from
      atomic contexts will be necessary.
      
      This patch replaces cgroup_rstat_mutex with a spinlock and adds a
      irq-safe flush function - cgroup_rstat_flush_irqsafe().  Explicit
      yield handling is added to the flush path so that other flush
      functions can yield to other threads and flushers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      0fa294fb
    • T
      cgroup: Factor out and expose cgroup_rstat_*() interface functions · 6162cef0
      Tejun Heo 提交于
      cgroup_rstat is being generalized so that controllers can use it too.
      This patch factors out and exposes the following interface functions.
      
      * cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
        consistency.
      
      * cgroup_rstat_flush_hold/release(): Factored out from base stat
        implementation.
      
      * cgroup_rstat_flush(): Verbatim expose.
      
      While at it, drop assert on cgroup_rstat_mutex in
      cgroup_base_stat_flush() as it crosses layers and make a minor comment
      update.
      
      v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      6162cef0
    • T
      cgroup: Reorganize kernel/cgroup/rstat.c · a17556f8
      Tejun Heo 提交于
      Currently, rstat.c has rstat and base stat implementations intermixed.
      Collect base stat implementation at the end of the file.  Also,
      reorder the prototypes.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a17556f8
    • T
      cgroup: Distinguish base resource stat implementation from rstat · d4ff749b
      Tejun Heo 提交于
      Base resource stat accounts universial (not specific to any
      controller) resource consumptions on top of rstat.  Currently, its
      implementation is intermixed with rstat implementation making the code
      confusing to follow.
      
      This patch clarifies the distintion by doing the followings.
      
      * Encapsulate base resource stat counters, currently only cputime, in
        struct cgroup_base_stat.
      
      * Move prev_cputime into struct cgroup and initialize it with cgroup.
      
      * Rename the related functions so that they start with cgroup_base_stat.
      
      * Prefix the related variables and field names with b.
      
      This patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d4ff749b
    • T
      cgroup: Rename stat to rstat · c58632b3
      Tejun Heo 提交于
      stat is too generic a name and ends up causing subtle confusions.
      It'll be made generic so that controllers can plug into it, which will
      make the problem worse.  Let's rename it to something more specific -
      cgroup_rstat for cgroup recursive stat.
      
      This patch does the following renames.  No other changes.
      
      * cpu_stat	-> rstat_cpu
      * stat		-> rstat
      * ?cstat	-> ?rstatc
      
      Note that the renames are selective.  The unrenamed are the ones which
      implement basic resource statistics on top of rstat.  This will be
      further cleaned up in the following patches.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c58632b3
    • T
      cgroup: Rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c · a5c2b93f
      Tejun Heo 提交于
      stat is too generic a name and ends up causing subtle confusions.
      It'll be made generic so that controllers can plug into it, which will
      make the problem worse.  Let's rename it to something more specific -
      cgroup_rstat for cgroup recursive stat.
      
      First, rename kernel/cgroup/stat.c to kernel/cgroup/rstat.c.  No
      content changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a5c2b93f
    • T
      cgroup: Limit event generation frequency · b12e3583
      Tejun Heo 提交于
      ".events" files generate file modified event to notify userland of
      possible new events.  Some of the events can be quite bursty
      (e.g. memory high event) and generating notification each time is
      costly and pointless.
      
      This patch implements a event rate limit mechanism.  If a new
      notification is requested before 10ms has passed since the previous
      notification, the new notification is delayed till then.
      
      As this only delays from the second notification on in a given close
      cluster of notifications, userland reactions to notifications
      shouldn't be delayed at all in most cases while avoiding notification
      storms.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b12e3583
    • T
      cgroup: Explicitly remove core interface files · 5faaf05f
      Tejun Heo 提交于
      The "cgroup." core interface files bypass the usual interface removal
      path and get removed recursively along with the cgroup itself.  While
      this works now, the subtle discrepancy gets in the way of implementing
      common mechanisms.
      
      This patch updates cgroup core interface file handling so that it's
      consistent with controller interface files.  When added, the css is
      marked CSS_VISIBLE and they're explicitly removed before the cgroup is
      destroyed.
      
      This doesn't cause user-visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5faaf05f