1. 21 7月, 2017 3 次提交
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147
    • T
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo 提交于
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      cgroup".
      
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bc2fb7ed
    • T
      cgroup: reorganize cgroup.procs / task write path · 715c809d
      Tejun Heo 提交于
      Currently, writes "cgroup.procs" and "cgroup.tasks" files are all
      handled by __cgroup_procs_write() on both v1 and v2.  This patch
      reoragnizes the write path so that there are common helper functions
      that different write paths use.
      
      While this somewhat increases LOC, the different paths are no longer
      intertwined and each path has more flexibility to implement different
      behaviors which will be necessary for the planned v2 thread support.
      
      v3: - Restructured so that cgroup_procs_write_permission() takes
            @src_cgrp and @dst_cgrp.
      
      v2: - Rolled in Waiman's task reference count fix.
          - Updated on top of nsdelegate changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      715c809d
  2. 15 6月, 2017 2 次提交
    • W
      cgroup: Move debug cgroup to its own file · a28f8f5e
      Waiman Long 提交于
      The debug cgroup currently resides within cgroup-v1.c and is enabled
      only for v1 cgroup. To enable the debug cgroup also for v2, it makes
      sense to put the code into its own file as it will no longer be v1
      specific. There is no change to the debug cgroup specific code.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      a28f8f5e
    • W
      cgroup: Keep accurate count of tasks in each css_set · 73a7242a
      Waiman Long 提交于
      The reference count in the css_set data structure was used as a
      proxy of the number of tasks attached to that css_set. However, that
      count is actually not an accurate measure especially with thread mode
      support. So a new variable nr_tasks is added to the css_set to keep
      track of the actual task count. This new variable is protected by
      the css_set_lock. Functions that require the actual task count are
      updated to use the new variable.
      
      tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
          Refreshed on top of cgroup/for-v4.13 which dropped on
          css_set_populated() -> nr_tasks conversion.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73a7242a
  3. 29 4月, 2017 1 次提交
  4. 16 4月, 2017 1 次提交
  5. 11 4月, 2017 1 次提交
    • Z
      cgroup: avoid attaching a cgroup root to two different superblocks · bfb0b80d
      Zefan Li 提交于
      Run this:
      
          touch file0
          for ((; ;))
          {
              mount -t cpuset xxx file0
          }
      
      And this concurrently:
      
          touch file1
          for ((; ;))
          {
              mount -t cpuset xxx file1
          }
      
      We'll trigger a warning like this:
      
       ------------[ cut here ]------------
       WARNING: CPU: 1 PID: 4675 at lib/percpu-refcount.c:317 percpu_ref_kill_and_confirm+0x92/0xb0
       percpu_ref_kill_and_confirm called more than once on css_release!
       CPU: 1 PID: 4675 Comm: mount Not tainted 4.11.0-rc5+ #5
       Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
       Call Trace:
        dump_stack+0x63/0x84
        __warn+0xd1/0xf0
        warn_slowpath_fmt+0x5f/0x80
        percpu_ref_kill_and_confirm+0x92/0xb0
        cgroup_kill_sb+0x95/0xb0
        deactivate_locked_super+0x43/0x70
        deactivate_super+0x46/0x60
       ...
       ---[ end trace a79f61c2a2633700 ]---
      
      Here's a race:
      
        Thread A				Thread B
      
        cgroup1_mount()
          # alloc a new cgroup root
          cgroup_setup_root()
      					cgroup1_mount()
      					  # no sb yet, returns NULL
      					  kernfs_pin_sb()
      
      					  # but succeeds in getting the refcnt,
      					  # so re-use cgroup root
      					  percpu_ref_tryget_live()
          # alloc sb with cgroup root
          cgroup_do_mount()
      
        cgroup_kill_sb()
      					  # alloc another sb with same root
      					  cgroup_do_mount()
      
      					cgroup_kill_sb()
      
      We end up using the same cgroup root for two different superblocks,
      so percpu_ref_kill() will be called twice on the same root when the
      two superblocks are destroyed.
      
      We should fix to make sure the superblock pinning is really successful.
      
      Cc: stable@vger.kernel.org # 3.16+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      bfb0b80d
  6. 09 3月, 2017 1 次提交
  7. 07 3月, 2017 1 次提交
  8. 03 3月, 2017 3 次提交
    • I
      sched/headers: Remove <linux/magic.h> from <linux/sched/task_stack.h> · 50ff9d13
      Ingo Molnar 提交于
      It's not used by any of the scheduler methods, but <linux/sched/task_stack.h>
      needs it to pick up STACK_END_MAGIC.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      50ff9d13
    • I
      sched/headers: Move the task_lock()/unlock() APIs to <linux/sched/task.h> · 56cd6973
      Ingo Molnar 提交于
      The task_lock()/task_unlock() APIs are not realated to core scheduling,
      they are task lifetime APIs, i.e. they belong into <linux/sched/task.h>.
      
      Move them.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      56cd6973
    • I
      sched/headers: Move task_struct::signal and task_struct::sighand types and... · c3edc401
      Ingo Molnar 提交于
      sched/headers: Move task_struct::signal and task_struct::sighand types and accessors into <linux/sched/signal.h>
      
      task_struct::signal and task_struct::sighand are pointers, which would normally make it
      straightforward to not define those types in sched.h.
      
      That is not so, because the types are accompanied by a myriad of APIs (macros and inline
      functions) that dereference them.
      
      Split the types and the APIs out of sched.h and move them into a new header, <linux/sched/signal.h>.
      
      With this change sched.h does not know about 'struct signal' and 'struct sighand' anymore,
      trying to put accessors into sched.h as a test fails the following way:
      
        ./include/linux/sched.h: In function ‘test_signal_types’:
        ./include/linux/sched.h:2461:18: error: dereferencing pointer to incomplete type ‘struct signal_struct’
                          ^
      
      This reduces the size and complexity of sched.h significantly.
      
      Update all headers and .c code that relied on getting the signal handling
      functionality from <linux/sched.h> to include <linux/sched/signal.h>.
      
      The list of affected files in the preparatory patch was partly generated by
      grepping for the APIs, and partly by doing coverage build testing, both
      all[yes|mod|def|no]config builds on 64-bit and 32-bit x86, and an array of
      cross-architecture builds.
      
      Nevertheless some (trivial) build breakage is still expected related to rare
      Kconfig combinations and in-flight patches to various kernel code, but most
      of it should be handled by this patch.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c3edc401
  9. 16 1月, 2017 2 次提交
    • T
      cgroup: call subsys->*attach() only for subsystems which are actually affected by migration · bfc2cf6f
      Tejun Heo 提交于
      Currently, subsys->*attach() callbacks are called for all subsystems
      which are attached to the hierarchy on which the migration is taking
      place.
      
      With cgroup_migrate_prepare_dst() filtering out identity migrations,
      v1 hierarchies can avoid spurious ->*attach() callback invocations
      where the source and destination csses are identical; however, this
      isn't enough on v2 as only a subset of the attached controllers can be
      affected on controller enable/disable.
      
      While spurious ->*attach() invocations aren't critically broken,
      they're unnecessary overhead and can lead to temporary overcharges on
      certain controllers.  Fix it by tracking which subsystems are affected
      by a migration and invoking ->*attach() callbacks only on those
      subsystems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      bfc2cf6f
    • T
      cgroup: track migration context in cgroup_mgctx · e595cd70
      Tejun Heo 提交于
      cgroup migration is performed in four steps - css_set preloading,
      addition of target tasks, actual migration, and clean up.  A list
      named preloaded_csets is used to track the preloading.  This is a bit
      too restricted and the code is already depending on the subtlety that
      all source css_sets appear before destination ones.
      
      Let's create struct cgroup_mgctx which keeps track of everything
      during migration.  Currently, it has separate preload lists for source
      and destination csets and also embeds cgroup_taskset which is used
      during the actual migration.  This moves struct cgroup_taskset
      definition to cgroup-internal.h.
      
      This patch doesn't cause any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      e595cd70
  10. 28 12月, 2016 5 次提交