1. 03 3月, 2016 18 次提交
    • T
      cgroup: allocate 2x cgrp_cset_links when setting up a new root · 04313591
      Tejun Heo 提交于
      During prep, cgroup_setup_root() allocates cgrp_cset_links matching
      the number of existing css_sets to later link the new root.  This is
      fine for now as the only operation which can happen inbetween is
      rebind_subsystems() and rebinding of empty subsystems doesn't create
      new css_sets.
      
      However, while not yet allowed, with the recent reimplementation,
      rebind_subsystems() can rebind subsystems with descendant csses and
      thus can create new css_sets.  This patch makes cgroup_setup_root()
      allocate 2x of the existing css_sets so that later use of live
      subsystem rebinding doesn't blow up.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      04313591
    • T
      cgroup: make cgroup_calc_subtree_ss_mask() take @this_ss_mask · 5ced2518
      Tejun Heo 提交于
      cgroup_calc_subtree_ss_mask() currently takes @cgrp and
      @subtree_control.  @cgrp is used for two purposes - to decide whether
      it's for default hierarchy and the mask of available subsystems.  The
      former doesn't matter as the results are the same regardless.  The
      latter can be specified directly through a subsystem mask.
      
      This patch makes cgroup_calc_subtree_ss_mask() perform the same
      calculations for both default and legacy hierarchies and take
      @this_ss_mask for available subsystems.  @cgrp is no longer used and
      dropped.  This is to allow using the function in contexts where
      available controllers can't be decided from the cgroup.
      
      v2: cgroup_refres_subtree_ss_mask() is removed by a previous patch.
          Updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      5ced2518
    • T
      cgroup: reimplement rebind_subsystems() using cgroup_apply_control() and friends · 334c3679
      Tejun Heo 提交于
      rebind_subsystem() open codes quite a bit of css and interface file
      manipulations.  It tries to be fail-safe but doesn't quite achieve it.
      It can be greatly simplified by using the new css management helpers.
      This patch reimplements rebind_subsytsems() using
      cgroup_apply_control() and friends.
      
      * The half-baked rollback on file creation failure is dropped.  It is
        an extremely cold path, failure isn't critical, and, aside from
        kernel bugs, the only reason it can fail is memory allocation
        failure which pretty much doesn't happen for small allocations.
      
      * As cgroup_apply_control_disable() is now used to clean up root
        cgroup on rebind, make sure that it doesn't end up killing root
        csses.
      
      * All callers of rebind_subsystems() are updated to use
        cgroup_lock_and_drain_offline() as the apply_control functions
        require drained subtree.
      
      * This leaves cgroup_refresh_subtree_ss_mask() without any user.
        Removed.
      
      * css_populate_dir() and css_clear_dir() no longer needs
        @cgrp_override parameter.  Dropped.
      
      * While at it, add WARN_ON() to rebind_subsystem() calls which are
        expected to always succeed just in case.
      
      While the rules visible to userland aren't changed, this
      reimplementation not only simplifies rebind_subsystems() but also
      allows it to disable and enable csses recursively.  This can be used
      to implement more flexible rebinding.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      334c3679
    • T
      cgroup: use cgroup_apply_enable_control() in cgroup creation path · 03970d3c
      Tejun Heo 提交于
      cgroup_create() manually updates control masks and creates child csses
      which cgroup_mkdir() then manually populates.  Both can be simplified
      by using cgroup_apply_enable_control() and friends.  The only catch is
      that it calls css_populate_dir() with NULL cgroup->kn during
      cgroup_create().  This is worked around by making the function noop on
      NULL kn.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      03970d3c
    • T
      cgroup: combine cgroup_mutex locking and offline css draining · 945ba199
      Tejun Heo 提交于
      cgroup_drain_offline() is used to wait for csses being offlined to
      uninstall itself from cgroup->subsys[] array so that new csses can be
      installed.  The function's only user, cgroup_subtree_control_write(),
      calls it after performing some checks and restarts the whole process
      via restart_syscall() if draining has to release cgroup_mutex to wait.
      
      This can be simplified by draining before other synchronized
      operations so that there's nothing to restart.  This patch converts
      cgroup_drain_offline() to cgroup_lock_and_drain_offline() which
      performs both locking and draining and updates cgroup_kn_lock_live()
      use it instead of cgroup_mutex() if requested.  This combined locking
      and draining operations are easier to use and less error-prone.
      
      While at it, add WARNs in control_apply functions which triggers if
      the subtree isn't properly drained.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      945ba199
    • T
      cgroup: factor out cgroup_{apply|finalize}_control() from cgroup_subtree_control_write() · f7b2814b
      Tejun Heo 提交于
      Factor out cgroup_{apply|finalize}_control() so that control mask
      update can be done in several simple steps.  This patch doesn't
      introduce behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      f7b2814b
    • T
      cgroup: introduce cgroup_{save|propagate|restore}_control() · 15a27c36
      Tejun Heo 提交于
      While controllers are being enabled and disabled in
      cgroup_subtree_control_write(), the original subsystem masks are
      stashed in local variables so that they can be restored if the
      operation fails in the middle.
      
      This patch adds dedicated fields to struct cgroup to be used instead
      of the local variables and implements functions to stash the current
      values, propagate the changes and restore them recursively.  Combined
      with the previous changes, this makes subsystem management operations
      fully recursive and modularlized.  This will be used to expand cgroup
      core functionalities.
      
      While at it, remove now unused @css_enable and @css_disable from
      cgroup_subtree_control_write().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      15a27c36
    • T
      cgroup: make cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() recursive · ce3f1d9d
      Tejun Heo 提交于
      The three factored out css management operations -
      cgroup_drain_offline() and cgroup_apply_control_{disable|enable}() -
      only depend on the current state of the target cgroups and idempotent
      and thus can be easily made to operate on the subtree instead of the
      immediate children.
      
      This patch introduces the iterators which walk live subtree and
      converts the three functions to operate on the subtree including self
      instead of the children.  While this leads to spurious walking and be
      slightly more expensive, it will allow them to be used for wider scope
      of operations.
      
      Note that cgroup_drain_offline() now tests for whether a css is dying
      before trying to drain it.  This is to avoid trying to drain live
      csses as there can be mix of live and dying csses in a subtree unlike
      children of the same parent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      ce3f1d9d
    • T
      cgroup: factor out cgroup_apply_control_enable() from cgroup_subtree_control_write() · bdb53bd7
      Tejun Heo 提交于
      Factor out css enabling and showing into cgroup_apply_control_enable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable, simply
        enable or show csses according to the current cgroup_control() and
        cgroup_ss_mask().  This leads to the same result and is simpler and
        more robust.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      bdb53bd7
    • T
      cgroup: factor out cgroup_apply_control_disable() from cgroup_subtree_control_write() · 12b3bb6a
      Tejun Heo 提交于
      Factor out css disabling and hiding into cgroup_apply_control_disable().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Instead of operating on the differential masks @css_enable and
        @css_disable, simply disable or hide csses according to the current
        cgroup_control() and cgroup_ss_mask().  This leads to the same
        result and is simpler and more robust.
      
      * This allows error handling path to share the same code.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      12b3bb6a
    • T
      cgroup: factor out cgroup_drain_offline() from cgroup_subtree_control_write() · 1b9b96a1
      Tejun Heo 提交于
      Factor out async css offline draining into cgroup_drain_offline().
      
      * Nest subsystem walk inside child walk.  The child walk will later be
        converted to subtree walk which is a bit more expensive.
      
      * Relocate the draining above subsystem mask preparation, which
        doesn't create any behavior differences but helps further
        refactoring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      1b9b96a1
    • T
      cgroup: introduce cgroup_control() and cgroup_ss_mask() · 5531dc91
      Tejun Heo 提交于
      When a controller is enabled and visible on a non-root cgroup is
      determined by subtree_control and subtree_ss_mask of the parent
      cgroup.  For a root cgroup, by the type of the hierarchy and which
      controllers are attached to it.  Deciding the above on each usage is
      fragile and unnecessarily complicates the users.
      
      This patch introduces cgroup_control() and cgroup_ss_mask() which
      calculate and return the [visibly] enabled subsyste mask for the
      specified cgroup and conver the existing usages.
      
      * cgroup_e_css() is restructured for simplicity.
      
      * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
        longer need to distinguish root and non-root cases.
      
      * With cgroup_control(), cgroup_controllers_show() can now handle both
        root and non-root cases.  cgroup_root_controllers_show() is removed.
      
      v2: cgroup_control() updated to yield the correct result on v1
          hierarchies too.  cgroup_subtree_control_write() converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      5531dc91
    • T
      cgroup: factor out cgroup_create() out of cgroup_mkdir() · a5bca215
      Tejun Heo 提交于
      We're in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.  This patch factors out cgroup_create() out of
      cgroup_mkdir().  cgroup_create() contains all internal object creation
      and initialization.  cgroup_mkdir() uses cgroup_create() to create the
      internal cgroup and adds interface directory and file creation.
      
      This patch doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      a5bca215
    • T
      cgroup: reorder operations in cgroup_mkdir() · 195e9b6c
      Tejun Heo 提交于
      Currently, operations to initialize internal objects and create
      interface directory and files are intermixed in cgroup_mkdir().  We're
      in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.
      
      This patch reorders operations inside cgroup_mkdir() so that interface
      directory and file handling comes after internal object
      initialization.  This will enable further refactoring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      195e9b6c
    • T
      cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9
      Tejun Heo 提交于
      Currently, whether a css (cgroup_subsys_state) has its interface files
      created is not tracked and assumed to change together with the owning
      cgroup's lifecycle.  cgroup directory and interface creation is being
      separated out from internal object creation to help refactoring and
      eventually allow cgroups which are not visible through cgroupfs.
      
      This patch adds CSS_VISIBLE to track whether a css has its interface
      files created and perform management operations only when necessary
      which helps decoupling interface file handling from internal object
      lifecycle.  After this patch, all css interface file management
      functions can be called regardless of the current state and will
      achieve the expected result.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      88cb04b9
    • T
      cgroup: separate out interface file creation from css creation · 6cd0f5bb
      Tejun Heo 提交于
      Currently, interface files are created when a css is created depending
      on whether @visible is set.  This patch separates out the two into
      separate steps to help code refactoring and eventually allow cgroups
      which aren't visible through cgroup fs.
      
      Move css_populate_dir() out of create_css() and drop @visible.  While
      at it, rename the function to css_create() for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      6cd0f5bb
    • T
      cgroup: suppress spurious de-populated events · 20b454a6
      Tejun Heo 提交于
      During task migration, tasks may transfer between two css_sets which
      are associated with the same cgroup.  If those tasks are the only
      tasks in the cgroup, this currently triggers a spurious de-populated
      event on the cgroup.
      
      Fix it by bumping up populated count before bumping it down during
      migration to ensure that it doesn't reach zero spuriously.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      20b454a6
    • T
      cgroup: re-hash init_css_set after subsystems are initialized · 2378d8b8
      Tejun Heo 提交于
      css_sets are hashed by their subsys[] contents and in cgroup_init()
      init_css_set is hashed early, before subsystem inits, when all entries
      in its subsys[] are NULL, so that cgroup_dfl_root initialization can
      find and link to it.  As subsystems are initialized,
      init_css_set.subsys[] is filled up but the hashing is never updated
      making init_css_set hashed in the wrong place.  While incorrect, this
      doesn't cause a critical failure as css_set management code would
      create an identical css_set dynamically.
      
      Fix it by rehashing init_css_set after subsystems are initialized.
      While at it, drop unnecessary @key local variable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      2378d8b8
  2. 02 3月, 2016 1 次提交
    • V
      cgroup: reset css on destruction · fa06235b
      Vladimir Davydov 提交于
      An associated css can be around for quite a while after a cgroup
      directory has been removed. In general, it makes sense to reset it to
      defaults so as not to worry about any remnants. For instance, memory
      cgroup needs to reset memory.low, otherwise pages charged to a dead
      cgroup might never get reclaimed. There's ->css_reset callback, which
      would fit perfectly for the purpose. Currently, it's only called when a
      subsystem is disabled in the unified hierarchy and there are other
      subsystems dependant on it. Let's call it on css destruction as well.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fa06235b
  3. 27 2月, 2016 1 次提交
  4. 23 2月, 2016 9 次提交
  5. 13 2月, 2016 1 次提交
    • J
      cgroup: provide cgroup_nov1= to disable controllers in v1 mounts · 223ffb29
      Johannes Weiner 提交于
      Testing cgroup2 can be painful with system software automatically
      mounting and populating all cgroup controllers in v1 mode. Sometimes
      they can be unmounted from rc.local, sometimes even that is too late.
      
      Provide a commandline option to disable certain controllers in v1
      mounts, so that they remain available for cgroup2 mounts.
      
      Example use:
      
      cgroup_no_v1=memory,cpu
      cgroup_no_v1=all
      
      Disabling will be confirmed at boot-time as such:
      
      [    0.013770] Disabling cpu control group subsystem in v1 mounts
      [    0.016004] Disabling memory control group subsystem in v1 mounts
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      223ffb29
  6. 22 1月, 2016 3 次提交
    • T
      cgroup: make sure a parent css isn't freed before its children · 8bb5ef79
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      The previous patch updated ordering for css_offline() which fixes the
      cpu controller issue.  While there currently isn't a known bug caused
      by misordering of css_free() invocations, let's fix it too for
      consistency.
      
      css_free() ordering can be trivially fixed by moving putting of the
      parent css below css_free() invocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8bb5ef79
    • T
      cgroup: make sure a parent css isn't offlined before its children · aa226ff4
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      This patch updates offline path so that a parent css is never offlined
      before its children.  Each css keeps online_cnt which reaches zero iff
      itself and all its children are offline and offline_css() is invoked
      only after online_cnt reaches zero.
      
      This fixes the memory controller bug and allows the fix for cpu
      controller.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reported-by: NBrian Christiansen <brian.o.christiansen@gmail.com>
      Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
      Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      aa226ff4
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  7. 11 1月, 2016 1 次提交
  8. 02 1月, 2016 1 次提交
  9. 15 12月, 2015 1 次提交
  10. 09 12月, 2015 1 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
  11. 03 12月, 2015 2 次提交
    • O
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov 提交于
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b53202e6
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  12. 30 11月, 2015 1 次提交
    • T
      cgroup: make css_set pin its css's to avoid use-afer-free · 53254f90
      Tejun Heo 提交于
      A css_set represents the relationship between a set of tasks and
      css's.  css_set never pinned the associated css's.  This was okay
      because tasks used to always disassociate immediately (in RCU sense) -
      either a task is moved to a different css_set or exits and never
      accesses css_set again.
      
      Unfortunately, afcf6c8b ("cgroup: add cgroup_subsys->free() method
      and use it to fix pids controller") and patches leading up to it made
      a zombie hold onto its css_set and deref the associated css's on its
      release.  Nothing pins the css's after exit and it might have already
      been freed leading to use-after-free.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
       RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
       ...
       Call Trace:
        <IRQ>
        [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
        [<ffffffff810f8893>] cgroup_free+0x53/0xe0
        [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
        [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
        [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
        [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
        [<ffffffff81056e54>] __do_softirq+0xd4/0x460
        [<ffffffff81057369>] irq_exit+0x89/0xa0
        [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
        <EOI>
       ...
       Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
       RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
        RSP <ffff88001fc03e20>
       ---[ end trace 89a4a4b916b90c49 ]---
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: disabled
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by making css_set pin the associate css's until its release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
      Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
      Fixes: afcf6c8b ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
      53254f90