1. 02 1月, 2018 1 次提交
  2. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  3. 27 10月, 2017 1 次提交
    • T
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo 提交于
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      available.
      
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      d41bf8c9
  4. 25 9月, 2017 1 次提交
    • T
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo 提交于
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      
      v2: Minor changes and documentation updates as suggested by Waiman and
          Roman.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      041cd640
  5. 18 8月, 2017 1 次提交
  6. 03 8月, 2017 2 次提交
    • R
      cgroup: implement hierarchy limits · 1a926e0b
      Roman Gushchin 提交于
      Creating cgroup hierearchies of unreasonable size can affect
      overall system performance. A user might want to limit the
      size of cgroup hierarchy. This is especially important if a user
      is delegating some cgroup sub-tree.
      
      To address this issue, introduce an ability to control
      the size of cgroup hierarchy.
      
      The cgroup.max.descendants control file allows to set the maximum
      allowed number of descendant cgroups.
      The cgroup.max.depth file controls the maximum depth of the cgroup
      tree. Both are single value r/w files, with "max" default value.
      
      The control files exist on each hierarchy level (including root).
      When a new cgroup is created, we check the total descendants
      and depth limits on each level, and if none of them are exceeded,
      a new cgroup is created.
      
      Only alive cgroups are counted, removed (dying) cgroups are
      ignored.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      1a926e0b
    • R
      cgroup: keep track of number of descent cgroups · 0679dee0
      Roman Gushchin 提交于
      Keep track of the number of online and dying descent cgroups.
      
      This data will be used later to add an ability to control cgroup
      hierarchy (limit the depth and the number of descent cgroups)
      and display hierarchy stats.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan@huawei.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: kernel-team@fb.com
      Cc: cgroups@vger.kernel.org
      Cc: linux-doc@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      0679dee0
  7. 21 7月, 2017 2 次提交
    • T
      cgroup: implement cgroup v2 thread support · 8cfd8147
      Tejun Heo 提交于
      This patch implements cgroup v2 thread support.  The goal of the
      thread mode is supporting hierarchical accounting and control at
      thread granularity while staying inside the resource domain model
      which allows coordination across different resource controllers and
      handling of anonymous resource consumptions.
      
      A cgroup is always created as a domain and can be made threaded by
      writing to the "cgroup.type" file.  When a cgroup becomes threaded, it
      becomes a member of a threaded subtree which is anchored at the
      closest ancestor which isn't threaded.
      
      The threads of the processes which are in a threaded subtree can be
      placed anywhere without being restricted by process granularity or
      no-internal-process constraint.  Note that the threads aren't allowed
      to escape to a different threaded subtree.  To be used inside a
      threaded subtree, a controller should explicitly support threaded mode
      and be able to handle internal competition in the way which is
      appropriate for the resource.
      
      The root of a threaded subtree, the nearest ancestor which isn't
      threaded, is called the threaded domain and serves as the resource
      domain for the whole subtree.  This is the last cgroup where domain
      controllers are operational and where all the domain-level resource
      consumptions in the subtree are accounted.  This allows threaded
      controllers to operate at thread granularity when requested while
      staying inside the scope of system-level resource distribution.
      
      As the root cgroup is exempt from the no-internal-process constraint,
      it can serve as both a threaded domain and a parent to normal cgroups,
      so, unlike non-root cgroups, the root cgroup can have both domain and
      threaded children.
      
      Internally, in a threaded subtree, each css_set has its ->dom_cset
      pointing to a matching css_set which belongs to the threaded domain.
      This ensures that thread root level cgroup_subsys_state for all
      threaded controllers are readily accessible for domain-level
      operations.
      
      This patch enables threaded mode for the pids and perf_events
      controllers.  Neither has to worry about domain-level resource
      consumptions and it's enough to simply set the flag.
      
      For more details on the interface and behavior of the thread mode,
      please refer to the section 2-2-2 in Documentation/cgroup-v2.txt added
      by this patch.
      
      v5: - Dropped silly no-op ->dom_cgrp init from cgroup_create().
            Spotted by Waiman.
          - Documentation updated as suggested by Waiman.
          - cgroup.type content slightly reformatted.
          - Mark the debug controller threaded.
      
      v4: - Updated to the general idea of marking specific cgroups
            domain/threaded as suggested by PeterZ.
      
      v3: - Dropped "join" and always make mixed children join the parent's
            threaded subtree.
      
      v2: - After discussions with Waiman, support for mixed thread mode is
            added.  This should address the issue that Peter pointed out
            where any nesting should be avoided for thread subtrees while
            coexisting with other domain cgroups.
          - Enabling / disabling thread mode now piggy backs on the existing
            control mask update mechanism.
          - Bug fixes and cleanup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8cfd8147
    • T
      cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad
      Tejun Heo 提交于
      cgroup v2 is in the process of growing thread granularity support.  A
      threaded subtree is composed of a thread root and threaded cgroups
      which are proper members of the subtree.
      
      The root cgroup of the subtree serves as the domain cgroup to which
      the processes (as opposed to threads / tasks) of the subtree
      conceptually belong and domain-level resource consumptions not tied to
      any specific task are charged.  Inside the subtree, threads won't be
      subject to process granularity or no-internal-task constraint and can
      be distributed arbitrarily across the subtree.
      
      This patch introduces cgroup->dom_cgrp along with threaded css_set
      handling.
      
      * cgroup->dom_cgrp points to self for normal and thread roots.  For
        proper thread subtree members, points to the dom_cgrp (the thread
        root).
      
      * css_set->dom_cset points to self if for normal and thread roots.  If
        threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
        The dom_cgrp serves as the resource domain and keeps the matching
        csses available.  The dom_cset holds those csses and makes them
        easily accessible.
      
      * All threaded csets are linked on their dom_csets to enable iteration
        of all threaded tasks.
      
      * cgroup->nr_threaded_children keeps track of the number of threaded
        children.
      
      This patch adds the above but doesn't actually use them yet.  The
      following patches will build on top.
      
      v4: ->nr_threaded_children added.
      
      v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      
      v2: Added cgroup_is_threaded() helper.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      454000ad
  8. 17 7月, 2017 1 次提交
    • T
      cgroup: distinguish local and children populated states · 788b950c
      Tejun Heo 提交于
      cgrp->populated_cnt counts both local (the cgroup's populated
      css_sets) and subtree proper (populated children) so that it's only
      zero when the whole subtree, including self, is empty.
      
      This patch splits the counter into two so that local and children
      populated states are tracked separately.  It allows finer-grained
      tests on the state of the hierarchy which will be used to replace
      css_set walking local populated test.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      788b950c
  9. 29 6月, 2017 1 次提交
    • T
      cgroup: implement "nsdelegate" mount option · 5136f636
      Tejun Heo 提交于
      Currently, cgroup only supports delegation to !root users and cgroup
      namespaces don't get any special treatments.  This limits the
      usefulness of cgroup namespaces as they by themselves can't be safe
      delegation boundaries.  A process inside a cgroup can change the
      resource control knobs of the parent in the namespace root and may
      move processes in and out of the namespace if cgroups outside its
      namespace are visible somehow.
      
      This patch adds a new mount option "nsdelegate" which makes cgroup
      namespaces delegation boundaries.  If set, cgroup behaves as if write
      permission based delegation took place at namespace boundaries -
      writes to the resource control knobs from the namespace root are
      denied and migration crossing the namespace boundary aren't allowed
      from inside the namespace.
      
      This allows cgroup namespace to function as a delegation boundary by
      itself.
      
      v2: Silently ignore nsdelegate specified on !init mounts.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aravind Anbudurai <aru7@fb.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      5136f636
  10. 15 6月, 2017 1 次提交
    • W
      cgroup: Keep accurate count of tasks in each css_set · 73a7242a
      Waiman Long 提交于
      The reference count in the css_set data structure was used as a
      proxy of the number of tasks attached to that css_set. However, that
      count is actually not an accurate measure especially with thread mode
      support. So a new variable nr_tasks is added to the css_set to keep
      track of the actual task count. This new variable is protected by
      the css_set_lock. Functions that require the actual task count are
      updated to use the new variable.
      
      tj: s/task_count/nr_tasks/ for consistency with cgroup_root->nr_cgrps.
          Refreshed on top of cgroup/for-v4.13 which dropped on
          css_set_populated() -> nr_tasks conversion.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      73a7242a
  11. 18 5月, 2017 1 次提交
  12. 11 4月, 2017 1 次提交
    • T
      cgroup: move cgroup_subsys_state parent field for cache locality · b8b1a2e5
      Todd Poynor 提交于
      Various structures embed a struct cgroup_subsys_state, typically at
      the top of the containing structure.  It is common for code that
      accesses the structures to perform operations that iterate over the
      chain of parent css pointers, also accessing data in each containing
      structure.  In particular, struct cpuacct is used by fairly hot code
      paths in the scheduler such as cpuacct_charge().
      
      Move the parent css pointer field to the end of the structure to
      increase the chances of residing in the same cache line as the data
      from the containing structure.
      Signed-off-by: NTodd Poynor <toddpoynor@google.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b8b1a2e5
  13. 09 3月, 2017 1 次提交
  14. 02 3月, 2017 1 次提交
    • I
      sched/headers, cgroups: Remove the threadgroup_change_*() wrappery · 780de9dd
      Ingo Molnar 提交于
      threadgroup_change_begin()/end() is a pointless wrapper around
      cgroup_threadgroup_change_begin()/end(), minus a might_sleep()
      in the !CONFIG_CGROUPS=y case.
      
      Remove the wrappery, move the might_sleep() (the down_read()
      already has a might_sleep() check).
      
      This debloats <linux/sched.h> a bit and simplifies this API.
      
      Update all call sites.
      
      No change in functionality.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      780de9dd
  15. 28 12月, 2016 2 次提交
  16. 26 11月, 2016 1 次提交
    • D
      cgroup: add support for eBPF programs · 30070984
      Daniel Mack 提交于
      This patch adds two sets of eBPF program pointers to struct cgroup.
      One for such that are directly pinned to a cgroup, and one for such
      that are effective for it.
      
      To illustrate the logic behind that, assume the following example
      cgroup hierarchy.
      
        A - B - C
              \ D - E
      
      If only B has a program attached, it will be effective for B, C, D
      and E. If D then attaches a program itself, that will be effective for
      both D and E, and the program in B will only affect B and C. Only one
      program of a given type is effective for a cgroup.
      
      Attaching and detaching programs will be done through the bpf(2)
      syscall. For now, ingress and egress inet socket filtering are the
      only supported use-cases.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30070984
  17. 26 4月, 2016 1 次提交
    • T
      cgroup, cpuset: replace cpuset_post_attach_flush() with cgroup_subsys->post_attach callback · 5cf1cacb
      Tejun Heo 提交于
      Since e93ad19d ("cpuset: make mm migration asynchronous"), cpuset
      kicks off asynchronous NUMA node migration if necessary during task
      migration and flushes it from cpuset_post_attach_flush() which is
      called at the end of __cgroup_procs_write().  This is to avoid
      performing migration with cgroup_threadgroup_rwsem write-locked which
      can lead to deadlock through dependency on kworker creation.
      
      memcg has a similar issue with charge moving, so let's convert it to
      an official callback rather than the current one-off cpuset specific
      function.  This patch adds cgroup_subsys->post_attach callback and
      makes cpuset register cpuset_post_attach_flush() as its ->post_attach.
      
      The conversion is mostly one-to-one except that the new callback is
      called under cgroup_mutex.  This is to guarantee that no other
      migration operations are started before ->post_attach callbacks are
      finished.  cgroup_mutex is one of the outermost mutex in the system
      and has never been and shouldn't be a problem.  We can add specialized
      synchronization around __cgroup_procs_write() but I don't think
      there's any noticeable benefit.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org> # 4.4+ prerequisite for the next patch
      5cf1cacb
  18. 17 3月, 2016 1 次提交
    • T
      cgroup: ignore css_sets associated with dead cgroups during migration · 2b021cbf
      Tejun Heo 提交于
      Before 2e91fa7f ("cgroup: keep zombies associated with their
      original cgroups"), all dead tasks were associated with init_css_set.
      If a zombie task is requested for migration, while migration prep
      operations would still be performed on init_css_set, the actual
      migration would ignore zombie tasks.  As init_css_set is always valid,
      this worked fine.
      
      However, after 2e91fa7f, zombie tasks stay with the css_set it was
      associated with at the time of death.  Let's say a task T associated
      with cgroup A on hierarchy H-1 and cgroup B on hiearchy H-2.  After T
      becomes a zombie, it would still remain associated with A and B.  If A
      only contains zombie tasks, it can be removed.  On removal, A gets
      marked offline but stays pinned until all zombies are drained.  At
      this point, if migration is initiated on T to a cgroup C on hierarchy
      H-2, migration path would try to prepare T's css_set for migration and
      trigger the following.
      
       WARNING: CPU: 0 PID: 1576 at kernel/cgroup.c:474 cgroup_get+0x121/0x160()
       CPU: 0 PID: 1576 Comm: bash Not tainted 4.4.0-work+ #289
       ...
       Call Trace:
        [<ffffffff8127e63c>] dump_stack+0x4e/0x82
        [<ffffffff810445e8>] warn_slowpath_common+0x78/0xb0
        [<ffffffff810446d5>] warn_slowpath_null+0x15/0x20
        [<ffffffff810c33e1>] cgroup_get+0x121/0x160
        [<ffffffff810c349b>] link_css_set+0x7b/0x90
        [<ffffffff810c4fbc>] find_css_set+0x3bc/0x5e0
        [<ffffffff810c5269>] cgroup_migrate_prepare_dst+0x89/0x1f0
        [<ffffffff810c7547>] cgroup_attach_task+0x157/0x230
        [<ffffffff810c7a17>] __cgroup_procs_write+0x2b7/0x470
        [<ffffffff810c7bdc>] cgroup_tasks_write+0xc/0x10
        [<ffffffff810c4790>] cgroup_file_write+0x30/0x1b0
        [<ffffffff811c68fc>] kernfs_fop_write+0x13c/0x180
        [<ffffffff81151673>] __vfs_write+0x23/0xe0
        [<ffffffff81152494>] vfs_write+0xa4/0x1a0
        [<ffffffff811532d4>] SyS_write+0x44/0xa0
        [<ffffffff814af2d7>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It doesn't make sense to prepare migration for css_sets pointing to
      dead cgroups as they are guaranteed to contain only zombies which are
      ignored later during migration.  This patch makes cgroup destruction
      path mark all affected css_sets as dead and updates the migration path
      to ignore them during preparation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 2e91fa7f ("cgroup: keep zombies associated with their original cgroups")
      Cc: stable@vger.kernel.org # v4.4+
      2b021cbf
  19. 09 3月, 2016 2 次提交
    • T
      cgroup: implement cgroup_subsys->implicit_on_dfl · f6d635ad
      Tejun Heo 提交于
      Some controllers, perf_event for now and possibly freezer in the
      future, don't really make sense to control explicitly through
      "cgroup.subtree_control".  For example, the primary role of perf_event
      is identifying the cgroups of tasks; however, because the controller
      also keeps a small amount of state per cgroup, it can't be replaced
      with simple cgroup membership tests.
      
      This patch implements cgroup_subsys->implicit_on_dfl flag.  When set,
      the controller is implicitly enabled on all cgroups on the v2
      hierarchy so that utility type controllers such as perf_event can be
      enabled and function transparently.
      
      An implicit controller doesn't show up in "cgroup.controllers" or
      "cgroup.subtree_control", is exempt from no internal process rule and
      can be stolen from the default hierarchy even if there are non-root
      csses.
      
      v2: Reimplemented on top of the recent updates to css handling and
          subsystem rebinding.  Rebinding implicit subsystems is now a
          simple matter of exempting it from the busy subsystem check.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f6d635ad
    • T
      cgroup: use css_set->mg_dst_cgrp for the migration target cgroup · e4857982
      Tejun Heo 提交于
      Migration can be multi-target on the default hierarchy when a
      controller is enabled - processes belonging to each child cgroup have
      to be moved to the child cgroup itself to refresh css association.
      
      This isn't a problem for cgroup_migrate_add_src() as each source
      css_set still maps to single source and target cgroups; however,
      cgroup_migrate_prepare_dst() is called once after all source css_sets
      are added and thus might not have a single destination cgroup.  This
      is currently worked around by specifying NULL for @dst_cgrp and using
      the source's default cgroup as destination as the only multi-target
      migration in use is self-targetting.  While this works, it's subtle
      and clunky.
      
      As all taget cgroups are already specified while preparing the source
      css_sets, this clunkiness can easily be removed by recording the
      target cgroup in each source css_set.  This patch adds
      css_set->mg_dst_cgrp which is recorded on cgroup_migrate_src() and
      used by cgroup_migrate_prepare_dst().  This also makes migration code
      ready for arbitrary multi-target migration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e4857982
  20. 03 3月, 2016 2 次提交
    • T
      cgroup: introduce cgroup_{save|propagate|restore}_control() · 15a27c36
      Tejun Heo 提交于
      While controllers are being enabled and disabled in
      cgroup_subtree_control_write(), the original subsystem masks are
      stashed in local variables so that they can be restored if the
      operation fails in the middle.
      
      This patch adds dedicated fields to struct cgroup to be used instead
      of the local variables and implements functions to stash the current
      values, propagate the changes and restore them recursively.  Combined
      with the previous changes, this makes subsystem management operations
      fully recursive and modularlized.  This will be used to expand cgroup
      core functionalities.
      
      While at it, remove now unused @css_enable and @css_disable from
      cgroup_subtree_control_write().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      15a27c36
    • T
      cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9
      Tejun Heo 提交于
      Currently, whether a css (cgroup_subsys_state) has its interface files
      created is not tracked and assumed to change together with the owning
      cgroup's lifecycle.  cgroup directory and interface creation is being
      separated out from internal object creation to help refactoring and
      eventually allow cgroups which are not visible through cgroupfs.
      
      This patch adds CSS_VISIBLE to track whether a css has its interface
      files created and perform management operations only when necessary
      which helps decoupling interface file handling from internal object
      lifecycle.  After this patch, all css interface file management
      functions can be called regardless of the current state and will
      achieve the expected result.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      88cb04b9
  21. 23 2月, 2016 4 次提交
  22. 22 1月, 2016 1 次提交
  23. 10 12月, 2015 1 次提交
    • T
      cgroup: fix sock_cgroup_data initialization on earlier compilers · ad2c8c73
      Tejun Heo 提交于
      sock_cgroup_data is a struct containing an anonymous union.
      sock_cgroup_set_prioidx() and sock_cgroup_set_classid() were
      initializing a field inside the anonymous union as follows.
      
       struct sock_ccgroup_data skcd_buf = { .val = VAL };
      
      While this is fine on more recent compilers, gcc-4.4.7 triggers the
      following errors.
      
       include/linux/cgroup-defs.h: In function ‘sock_cgroup_set_prioidx’:
       include/linux/cgroup-defs.h:619: error: unknown field ‘val’ specified in initializer
       include/linux/cgroup-defs.h:619: warning: missing braces around initializer
       include/linux/cgroup-defs.h:619: warning: (near initialization for ‘skcd_buf.<anonymous>’)
      
      This is because .val belongs to the anonymous union nested inside the
      struct but the initializer is missing the nesting.  Fix it by adding
      an extra pair of braces.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NAlaa Hleihel <alaa@dev.mellanox.co.il>
      Fixes: bd1060a1 ("sock, cgroup: add sock->sk_cgroup")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad2c8c73
  24. 09 12月, 2015 2 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
    • T
      net: wrap sock->sk_cgrp_prioidx and ->sk_classid inside a struct · 2a56a1fe
      Tejun Heo 提交于
      Introduce sock->sk_cgrp_data which is a struct sock_cgroup_data.
      ->sk_cgroup_prioidx and ->sk_classid are moved into it.  The struct
      and its accessors are defined in cgroup-defs.h.  This is to prepare
      for overloading the fields with a cgroup pointer.
      
      This patch mostly performs equivalent conversions but the followings
      are noteworthy.
      
      * Equality test before updating classid is removed from
        sock_update_classid().  This shouldn't make any noticeable
        difference and a similar test will be implemented on the helper side
        later.
      
      * sock_update_netprioidx() now takes struct sock_cgroup_data and can
        be moved to netprio_cgroup.h without causing include dependency
        loop.  Moved.
      
      * The dummy version of sock_update_netprioidx() converted to a static
        inline function while at it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a56a1fe
  25. 03 12月, 2015 2 次提交
    • O
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov 提交于
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b53202e6
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  26. 21 11月, 2015 1 次提交
    • T
      cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it · b11cfb58
      Tejun Heo 提交于
      cgroup_is_descendant() currently walks up the hierarchy and compares
      each ancestor to the cgroup in question.  While enough for cgroup core
      usages, this can't be used in hot paths to test cgroup membership.
      This patch adds cgroup->ancestor_ids[] which records the IDs of all
      ancestors including self and cgroup->level for the nesting level.
      
      This allows testing whether a given cgroup is a descendant of another
      in three finite steps - testing whether the two belong to the same
      hierarchy, whether the descendant candidate is at the same or a higher
      level than the ancestor and comparing the recorded ancestor_id at the
      matching level.  cgroup_is_descendant() is accordingly reimplmented
      and made inline.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b11cfb58
  27. 17 11月, 2015 1 次提交
    • T
      cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type · 67e9c74b
      Tejun Heo 提交于
      With major controllers - cpu, memory and io - shaping up for the
      unified hierarchy, cgroup2 is about ready to be, gradually, released
      into the wild.  Replace __DEVEL__sane_behavior flag which was used to
      select the unified hierarchy with a separate filesystem type "cgroup2"
      so that unified hierarchy can be mounted as follows.
      
        mount -t cgroup2 none $MOUNT_POINT
      
      The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").
      
      v2: Assign a different magic number to cgroup2 fs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      67e9c74b
  28. 16 11月, 2015 1 次提交
    • T
      cgroup: fix cftype->file_offset handling · 34c06254
      Tejun Heo 提交于
      6f60eade ("cgroup: generalize obtaining the handles of and
      notifying cgroup files") introduced cftype->file_offset so that the
      handles for per-css file instances can be recorded.  These handles
      then can be used, for example, to generate file modified
      notifications.
      
      Unfortunately, it made the wrong assumption that files are created
      once for a given css and removed on its destruction.  Due to the
      dependencies among subsystems, a css may be hidden from userland and
      then later shown again.  This is implemented by removing and
      re-creating the affected files, so the associated kernfs_node for a
      given cgroup file may change over time.  This incorrect assumption led
      to the corruption of css->files lists.
      
      Reimplement cftype->file_offset handling so that cgroup_file->kn is
      protected by a lock and updated as files are created and destroyed.
      This also makes keeping them on per-cgroup list unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJames Sedgwick <jsedgwick@fb.com>
      Fixes: 6f60eade ("cgroup: generalize obtaining the handles of and notifying cgroup files")
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      34c06254
  29. 16 10月, 2015 2 次提交
    • T
      cgroup: add cgroup_subsys->free() method and use it to fix pids controller · afcf6c8b
      Tejun Heo 提交于
      pids controller is completely broken in that it uncharges when a task
      exits allowing zombies to escape resource control.  With the recent
      updates, cgroup core now maintains cgroup association till task free
      and pids controller can be fixed by uncharging on free instead of
      exit.
      
      This patch adds cgroup_subsys->free() method and update pids
      controller to use it instead of ->exit() for uncharging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      afcf6c8b
    • T
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo 提交于
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      2e91fa7f