1. 03 3月, 2016 7 次提交
    • T
      cgroup: introduce cgroup_control() and cgroup_ss_mask() · 5531dc91
      Tejun Heo 提交于
      When a controller is enabled and visible on a non-root cgroup is
      determined by subtree_control and subtree_ss_mask of the parent
      cgroup.  For a root cgroup, by the type of the hierarchy and which
      controllers are attached to it.  Deciding the above on each usage is
      fragile and unnecessarily complicates the users.
      
      This patch introduces cgroup_control() and cgroup_ss_mask() which
      calculate and return the [visibly] enabled subsyste mask for the
      specified cgroup and conver the existing usages.
      
      * cgroup_e_css() is restructured for simplicity.
      
      * cgroup_calc_subtree_ss_mask() and cgroup_subtree_control_write() no
        longer need to distinguish root and non-root cases.
      
      * With cgroup_control(), cgroup_controllers_show() can now handle both
        root and non-root cases.  cgroup_root_controllers_show() is removed.
      
      v2: cgroup_control() updated to yield the correct result on v1
          hierarchies too.  cgroup_subtree_control_write() converted.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      5531dc91
    • T
      cgroup: factor out cgroup_create() out of cgroup_mkdir() · a5bca215
      Tejun Heo 提交于
      We're in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.  This patch factors out cgroup_create() out of
      cgroup_mkdir().  cgroup_create() contains all internal object creation
      and initialization.  cgroup_mkdir() uses cgroup_create() to create the
      internal cgroup and adds interface directory and file creation.
      
      This patch doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      a5bca215
    • T
      cgroup: reorder operations in cgroup_mkdir() · 195e9b6c
      Tejun Heo 提交于
      Currently, operations to initialize internal objects and create
      interface directory and files are intermixed in cgroup_mkdir().  We're
      in the process of refactoring cgroup and css management paths to
      separate them out to eventually allow cgroups which aren't visible
      through cgroup fs.
      
      This patch reorders operations inside cgroup_mkdir() so that interface
      directory and file handling comes after internal object
      initialization.  This will enable further refactoring.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      195e9b6c
    • T
      cgroup: explicitly track whether a cgroup_subsys_state is visible to userland · 88cb04b9
      Tejun Heo 提交于
      Currently, whether a css (cgroup_subsys_state) has its interface files
      created is not tracked and assumed to change together with the owning
      cgroup's lifecycle.  cgroup directory and interface creation is being
      separated out from internal object creation to help refactoring and
      eventually allow cgroups which are not visible through cgroupfs.
      
      This patch adds CSS_VISIBLE to track whether a css has its interface
      files created and perform management operations only when necessary
      which helps decoupling interface file handling from internal object
      lifecycle.  After this patch, all css interface file management
      functions can be called regardless of the current state and will
      achieve the expected result.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      88cb04b9
    • T
      cgroup: separate out interface file creation from css creation · 6cd0f5bb
      Tejun Heo 提交于
      Currently, interface files are created when a css is created depending
      on whether @visible is set.  This patch separates out the two into
      separate steps to help code refactoring and eventually allow cgroups
      which aren't visible through cgroup fs.
      
      Move css_populate_dir() out of create_css() and drop @visible.  While
      at it, rename the function to css_create() for consistency.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      6cd0f5bb
    • T
      cgroup: suppress spurious de-populated events · 20b454a6
      Tejun Heo 提交于
      During task migration, tasks may transfer between two css_sets which
      are associated with the same cgroup.  If those tasks are the only
      tasks in the cgroup, this currently triggers a spurious de-populated
      event on the cgroup.
      
      Fix it by bumping up populated count before bumping it down during
      migration to ensure that it doesn't reach zero spuriously.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      20b454a6
    • T
      cgroup: re-hash init_css_set after subsystems are initialized · 2378d8b8
      Tejun Heo 提交于
      css_sets are hashed by their subsys[] contents and in cgroup_init()
      init_css_set is hashed early, before subsystem inits, when all entries
      in its subsys[] are NULL, so that cgroup_dfl_root initialization can
      find and link to it.  As subsystems are initialized,
      init_css_set.subsys[] is filled up but the hashing is never updated
      making init_css_set hashed in the wrong place.  While incorrect, this
      doesn't cause a critical failure as css_set management code would
      create an identical css_set dynamically.
      
      Fix it by rehashing init_css_set after subsystems are initialized.
      While at it, drop unnecessary @key local variable.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      2378d8b8
  2. 02 3月, 2016 1 次提交
    • V
      cgroup: reset css on destruction · fa06235b
      Vladimir Davydov 提交于
      An associated css can be around for quite a while after a cgroup
      directory has been removed. In general, it makes sense to reset it to
      defaults so as not to worry about any remnants. For instance, memory
      cgroup needs to reset memory.low, otherwise pages charged to a dead
      cgroup might never get reclaimed. There's ->css_reset callback, which
      would fit perfectly for the purpose. Currently, it's only called when a
      subsystem is disabled in the unified hierarchy and there are other
      subsystems dependant on it. Let's call it on css destruction as well.
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      fa06235b
  3. 27 2月, 2016 1 次提交
  4. 23 2月, 2016 9 次提交
  5. 13 2月, 2016 1 次提交
    • J
      cgroup: provide cgroup_nov1= to disable controllers in v1 mounts · 223ffb29
      Johannes Weiner 提交于
      Testing cgroup2 can be painful with system software automatically
      mounting and populating all cgroup controllers in v1 mode. Sometimes
      they can be unmounted from rc.local, sometimes even that is too late.
      
      Provide a commandline option to disable certain controllers in v1
      mounts, so that they remain available for cgroup2 mounts.
      
      Example use:
      
      cgroup_no_v1=memory,cpu
      cgroup_no_v1=all
      
      Disabling will be confirmed at boot-time as such:
      
      [    0.013770] Disabling cpu control group subsystem in v1 mounts
      [    0.016004] Disabling memory control group subsystem in v1 mounts
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      223ffb29
  6. 22 1月, 2016 3 次提交
    • T
      cgroup: make sure a parent css isn't freed before its children · 8bb5ef79
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      The previous patch updated ordering for css_offline() which fixes the
      cpu controller issue.  While there currently isn't a known bug caused
      by misordering of css_free() invocations, let's fix it too for
      consistency.
      
      css_free() ordering can be trivially fixed by moving putting of the
      parent css below css_free() invocation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      8bb5ef79
    • T
      cgroup: make sure a parent css isn't offlined before its children · aa226ff4
      Tejun Heo 提交于
      There are three subsystem callbacks in css shutdown path -
      css_offline(), css_released() and css_free().  Except for
      css_released(), cgroup core didn't guarantee the order of invocation.
      css_offline() or css_free() could be called on a parent css before its
      children.  This behavior is unexpected and led to bugs in cpu and
      memory controller.
      
      This patch updates offline path so that a parent css is never offlined
      before its children.  Each css keeps online_cnt which reaches zero iff
      itself and all its children are offline and offline_css() is invoked
      only after online_cnt reaches zero.
      
      This fixes the memory controller bug and allows the fix for cpu
      controller.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reported-by: NBrian Christiansen <brian.o.christiansen@gmail.com>
      Link: http://lkml.kernel.org/g/5698A023.9070703@de.ibm.com
      Link: http://lkml.kernel.org/g/CAKB58ikDkzc8REt31WBkD99+hxNzjK4+FBmhkgS+NVrC9vjMSg@mail.gmail.com
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      aa226ff4
    • T
      cpuset: make mm migration asynchronous · e93ad19d
      Tejun Heo 提交于
      If "cpuset.memory_migrate" is set, when a process is moved from one
      cpuset to another with a different memory node mask, pages in used by
      the process are migrated to the new set of nodes.  This was performed
      synchronously in the ->attach() callback, which is synchronized
      against process management.  Recently, the synchronization was changed
      from per-process rwsem to global percpu rwsem for simplicity and
      optimization.
      
      Combined with the synchronous mm migration, this led to deadlocks
      because mm migration could schedule a work item which may in turn try
      to create a new worker blocking on the process management lock held
      from cgroup process migration path.
      
      This heavy an operation shouldn't be performed synchronously from that
      deep inside cgroup migration in the first place.  This patch punts the
      actual migration to an ordered workqueue and updates cgroup process
      migration and cpuset config update paths to flush the workqueue after
      all locks are released.  This way, the operations still seem
      synchronous to userland without entangling mm migration with process
      management synchronization.  CPU hotplug can also invoke mm migration
      but there's no reason for it to wait for mm migrations and thus
      doesn't synchronize against their completions.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: stable@vger.kernel.org # v4.4+
      e93ad19d
  7. 11 1月, 2016 1 次提交
  8. 02 1月, 2016 1 次提交
  9. 15 12月, 2015 1 次提交
  10. 09 12月, 2015 1 次提交
    • T
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo 提交于
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd1060a1
  11. 03 12月, 2015 2 次提交
    • O
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov 提交于
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b53202e6
    • T
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo 提交于
      Consider the following v2 hierarchy.
      
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
             
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
      
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
       ...
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      1f7dd3e5
  12. 30 11月, 2015 1 次提交
    • T
      cgroup: make css_set pin its css's to avoid use-afer-free · 53254f90
      Tejun Heo 提交于
      A css_set represents the relationship between a set of tasks and
      css's.  css_set never pinned the associated css's.  This was okay
      because tasks used to always disassociate immediately (in RCU sense) -
      either a task is moved to a different css_set or exits and never
      accesses css_set again.
      
      Unfortunately, afcf6c8b ("cgroup: add cgroup_subsys->free() method
      and use it to fix pids controller") and patches leading up to it made
      a zombie hold onto its css_set and deref the associated css's on its
      release.  Nothing pins the css's after exit and it might have already
      been freed leading to use-after-free.
      
       general protection fault: 0000 [#1] PREEMPT SMP
       task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
       RIP: 0010:[<ffffffff810fa205>]  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
       ...
       Call Trace:
        <IRQ>
        [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
        [<ffffffff810f8893>] cgroup_free+0x53/0xe0
        [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
        [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
        [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
        [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
        [<ffffffff81056e54>] __do_softirq+0xd4/0x460
        [<ffffffff81057369>] irq_exit+0x89/0xa0
        [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
        [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
        <EOI>
       ...
       Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
       RIP  [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
        RSP <ffff88001fc03e20>
       ---[ end trace 89a4a4b916b90c49 ]---
       Kernel panic - not syncing: Fatal exception in interrupt
       Kernel Offset: disabled
       ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fix it by making css_set pin the associate css's until its release.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Link: http://lkml.kernel.org/g/20151120041836.GA18390@codemonkey.org.uk
      Link: http://lkml.kernel.org/g/5652D448.3080002@bmw-carit.de
      Fixes: afcf6c8b ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
      53254f90
  13. 21 11月, 2015 2 次提交
    • T
      cgroup: implement cgroup_get_from_path() and expose cgroup_put() · 16af4396
      Tejun Heo 提交于
      Implement cgroup_get_from_path() using kernfs_walk_and_get() which
      obtains a default hierarchy cgroup from its path.  This will be used
      to allow cgroup path based matching from outside cgroup proper -
      e.g. networking and perf.
      
      v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      16af4396
    • T
      cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it · b11cfb58
      Tejun Heo 提交于
      cgroup_is_descendant() currently walks up the hierarchy and compares
      each ancestor to the cgroup in question.  While enough for cgroup core
      usages, this can't be used in hot paths to test cgroup membership.
      This patch adds cgroup->ancestor_ids[] which records the IDs of all
      ancestors including self and cgroup->level for the nesting level.
      
      This allows testing whether a given cgroup is a descendant of another
      in three finite steps - testing whether the two belong to the same
      hierarchy, whether the descendant candidate is at the same or a higher
      level than the ancestor and comparing the recorded ancestor_id at the
      matching level.  cgroup_is_descendant() is accordingly reimplmented
      and made inline.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b11cfb58
  14. 17 11月, 2015 1 次提交
    • T
      cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type · 67e9c74b
      Tejun Heo 提交于
      With major controllers - cpu, memory and io - shaping up for the
      unified hierarchy, cgroup2 is about ready to be, gradually, released
      into the wild.  Replace __DEVEL__sane_behavior flag which was used to
      select the unified hierarchy with a separate filesystem type "cgroup2"
      so that unified hierarchy can be mounted as follows.
      
        mount -t cgroup2 none $MOUNT_POINT
      
      The cgroup2 fs has its own magic number - 0x63677270 ("cgrp").
      
      v2: Assign a different magic number to cgroup2 fs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      67e9c74b
  15. 16 11月, 2015 1 次提交
    • T
      cgroup: fix cftype->file_offset handling · 34c06254
      Tejun Heo 提交于
      6f60eade ("cgroup: generalize obtaining the handles of and
      notifying cgroup files") introduced cftype->file_offset so that the
      handles for per-css file instances can be recorded.  These handles
      then can be used, for example, to generate file modified
      notifications.
      
      Unfortunately, it made the wrong assumption that files are created
      once for a given css and removed on its destruction.  Due to the
      dependencies among subsystems, a css may be hidden from userland and
      then later shown again.  This is implemented by removing and
      re-creating the affected files, so the associated kernfs_node for a
      given cgroup file may change over time.  This incorrect assumption led
      to the corruption of css->files lists.
      
      Reimplement cftype->file_offset handling so that cgroup_file->kn is
      protected by a lock and updated as files are created and destroyed.
      This also makes keeping them on per-cgroup list unnecessary.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJames Sedgwick <jsedgwick@fb.com>
      Fixes: 6f60eade ("cgroup: generalize obtaining the handles of and notifying cgroup files")
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NZefan Li <lizefan@huawei.com>
      34c06254
  16. 07 11月, 2015 1 次提交
    • M
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep... · d0164adc
      Mel Gorman 提交于
      mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
      
      __GFP_WAIT has been used to identify atomic context in callers that hold
      spinlocks or are in interrupts.  They are expected to be high priority and
      have access one of two watermarks lower than "min" which can be referred
      to as the "atomic reserve".  __GFP_HIGH users get access to the first
      lower watermark and can be called the "high priority reserve".
      
      Over time, callers had a requirement to not block when fallback options
      were available.  Some have abused __GFP_WAIT leading to a situation where
      an optimisitic allocation with a fallback option can access atomic
      reserves.
      
      This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
      cannot sleep and have no alternative.  High priority users continue to use
      __GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
      are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
      callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
      redefined as a caller that is willing to enter direct reclaim and wake
      kswapd for background reclaim.
      
      This patch then converts a number of sites
      
      o __GFP_ATOMIC is used by callers that are high priority and have memory
        pools for those requests. GFP_ATOMIC uses this flag.
      
      o Callers that have a limited mempool to guarantee forward progress clear
        __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
        into this category where kswapd will still be woken but atomic reserves
        are not used as there is a one-entry mempool to guarantee progress.
      
      o Callers that are checking if they are non-blocking should use the
        helper gfpflags_allow_blocking() where possible. This is because
        checking for __GFP_WAIT as was done historically now can trigger false
        positives. Some exceptions like dm-crypt.c exist where the code intent
        is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
        flag manipulations.
      
      o Callers that built their own GFP flags instead of starting with GFP_KERNEL
        and friends now also need to specify __GFP_KSWAPD_RECLAIM.
      
      The first key hazard to watch out for is callers that removed __GFP_WAIT
      and was depending on access to atomic reserves for inconspicuous reasons.
      In some cases it may be appropriate for them to use __GFP_HIGH.
      
      The second key hazard is callers that assembled their own combination of
      GFP flags instead of starting with something like GFP_KERNEL.  They may
      now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
      if it's missed in most cases as other activity will wake kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0164adc
  17. 29 10月, 2015 1 次提交
    • T
      cgroup: fix race condition around termination check in css_task_iter_next() · d5745675
      Tejun Heo 提交于
      css_task_iter_next() checked @it->cur_task before grabbing
      css_set_lock and assumed that the result won't change afterwards;
      however, tasks could leave the cgroup being iterated terminating the
      iterator before css_task_lock is acquired.  If this happens,
      css_task_iter_next() tries to calculate the current task from NULL
      cg_list pointer leading to the following oops.
      
       BUG: unable to handle kernel paging request at fffffffffffff7d0
       IP: [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
       ...
       CPU: 4 PID: 6391 Comm: JobQDisp2 Not tainted 4.0.9-22_fbk4_rc3_81616_ge8d9cb6 #1
       Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B08 03/04/2014
       task: ffff880868e46400 ti: ffff88083404c000 task.ti: ffff88083404c000
       RIP: 0010:[<ffffffff810d5f22>]  [<ffffffff810d5f22>] css_task_iter_next+0x42/0x80
       RSP: 0018:ffff88083404fd28  EFLAGS: 00010246
       RAX: 0000000000000000 RBX: ffff88083404fd68 RCX: ffff8804697fb8b0
       RDX: fffffffffffff7c0 RSI: ffff8803b7dff800 RDI: ffffffff822c0278
       RBP: ffff88083404fd38 R08: 0000000000017160 R09: ffff88046f4070c0
       R10: ffffffff810d61f7 R11: 0000000000000293 R12: ffff880863bf8400
       R13: ffff88046b87fd80 R14: 0000000000000000 R15: ffff88083404fe58
       FS:  00007fa0567e2700(0000) GS:ffff88046f900000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: fffffffffffff7d0 CR3: 0000000469568000 CR4: 00000000001406e0
       Stack:
        0000000000000246 0000000000000000 ffff88083404fde8 ffffffff810d6248
        ffff88083404fd68 0000000000000000 ffff8803b7dff800 000001ef000001ee
        0000000000000000 0000000000000000 ffff880863bf8568 0000000000000000
       Call Trace:
        [<ffffffff810d6248>] cgroup_pidlist_start+0x258/0x550
        [<ffffffff810cf66d>] cgroup_seqfile_start+0x1d/0x20
        [<ffffffff8121f8ef>] kernfs_seq_start+0x5f/0xa0
        [<ffffffff811cab76>] seq_read+0x166/0x380
        [<ffffffff812200fd>] kernfs_fop_read+0x11d/0x180
        [<ffffffff811a7398>] __vfs_read+0x18/0x50
        [<ffffffff811a745d>] vfs_read+0x8d/0x150
        [<ffffffff811a756f>] SyS_read+0x4f/0xb0
        [<ffffffff818d4772>] system_call_fastpath+0x12/0x17
      
      Fix it by moving the termination condition check inside css_set_lock.
      @it->cur_task is now cleared after being put and @it->task_pos is
      tested for termination instead of @it->cset_pos as they indicate the
      same condition and @it->task_pos is what's being dereferenced.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NCalvin Owens <calvinowens@fb.com>
      Fixes: ed27b9f7 ("cgroup: don't hold css_set_rwsem across css task iteration")
      Acked-by: NZefan Li <lizefan@huawei.com>
      d5745675
  18. 16 10月, 2015 5 次提交
    • T
      cgroup: drop cgroup__DEVEL__legacy_files_on_dfl · e4b7037c
      Tejun Heo 提交于
      Now that interfaces for the major three controllers - cpu, memory, io
      - are shaping up, there's no reason to have an option to force legacy
      files to show up on the unified hierarchy for testing.  Drop it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      e4b7037c
    • T
      cgroup: replace error handling in cgroup_init() with WARN_ON()s · 035f4f51
      Tejun Heo 提交于
      The init sequence shouldn't fail short of bugs and even when it does
      it's better to continue with the rest of initialization and we were
      silently ignoring /proc/cgroups creation failure.
      
      Drop the explicit error handling and wrap sysfs_create_mount_point(),
      register_filesystem() and proc_create() with WARN_ON()s.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      035f4f51
    • T
      cgroup: add cgroup_subsys->free() method and use it to fix pids controller · afcf6c8b
      Tejun Heo 提交于
      pids controller is completely broken in that it uncharges when a task
      exits allowing zombies to escape resource control.  With the recent
      updates, cgroup core now maintains cgroup association till task free
      and pids controller can be fixed by uncharging on free instead of
      exit.
      
      This patch adds cgroup_subsys->free() method and update pids
      controller to use it instead of ->exit() for uncharging.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      afcf6c8b
    • T
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo 提交于
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      2e91fa7f
    • T
      cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock · f0d9a5f1
      Tejun Heo 提交于
      css_set_rwsem is the inner lock protecting css_sets and is accessed
      from hot paths such as fork and exit.  Internally, it has no reason to
      be a rwsem or even mutex.  There are no internal blocking operations
      while holding it.  This was rwsem because css task iteration used to
      expose it to external iterator users.  As the previous patch updated
      css task iteration such that the locking is not leaked to its users,
      there's no reason to keep it a rwsem.
      
      This patch converts css_set_rwsem to a spinlock and rename it to
      css_set_lock.  It uses bh-safe operations as a planned usage needs to
      access it from RCU callback context.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f0d9a5f1