1. 07 12月, 2013 7 次提交
    • T
      cgroup: factor out cgroup_subsys_state creation into create_css() · c81c925a
      Tejun Heo 提交于
      Now that all opertations to create a css (cgroup_subsys_state) are
      collected into a single loop in cgroup_create(), it's easy to factor
      it out into its own function.  Factor out css creation into
      create_css().  This makes the code easier to follow and will enable
      decoupling css creation from cgroup creation which is necessary for
      the planned unified hierarchy.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c81c925a
    • T
      cgroup: combine css handling loops in cgroup_create() · 9d403e99
      Tejun Heo 提交于
      Now that css operations in cgroup_create() are back-to-back, there
      isn't much point in allocating css's in one loop and onlining them in
      another.  Merge the two loops so that a css is allocated and onlined
      on each iteration.
      
      css_ar[] is no longer necessary and replaced with a single pointer.
      This also simplifies the error handling path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      9d403e99
    • T
      cgroup: reorder operations in cgroup_create() · 0d80255e
      Tejun Heo 提交于
      cgroup_create() currently does the followings.
      
      1. alloc cgroup
      2. alloc css's
      3. create the directory and commit to cgroup creation
      4. online css's
      5. create cgroup and css files
      
      The sequence performs allocations before other operations but it
      doesn't buy anything because each of the above steps may fail and
      should be unrollable.  Reorganize the sequence such that cgroup
      operations are done before css operations.
      
      1. alloc cgroup
      2. create the directory and files and commit to cgroup creation
      3. alloc css's
      4. create files for and online css's
      
      This simplifies the code a bit and enables further simplification and
      separating out css creation from cgroup creation which is necessary
      for the planned unified hierarchy where css's will be created and
      destroyed dynamically across the lifetime of a cgroup.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0d80255e
    • T
      cgroup: make for_each_subsys() useable under cgroup_root_mutex · 780cd8b3
      Tejun Heo 提交于
      We want to use for_each_subsys() in cgroupfs_root handling where only
      cgroup_root_mutex is held.  The only way cgroup_subsys[] can change is
      through module load/unload, make cgroup_[un]load_subsys() grab
      cgroup_root_mutex too and update the lockdep annotation in
      for_each_subsys() to allow either cgroup_mutex or cgroup_root_mutex.
      
      * Lockdep annotation is moved from inner 'if' condition to outer 'for'
        init caluse.  There's no reason to execute the assertion every loop.
      
      * Loop index @i is renamed to @ssid.  Indices iterating through subsys
        will be [re]named to @ssid gradually.
      
      v2: cgroup_assert_mutex_or_root_locked() caused build failure if
          !CONFIG_LOCKEDP.  Conditionalize its definition.  The build failure
          was reported by kbuild test bot.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: kbuild test robot <fengguang.wu@intel.com>
      780cd8b3
    • T
      cgroup: css iterations and css_from_dir() are safe under cgroup_mutex · 87fb54f1
      Tejun Heo 提交于
      Currently, all css iterations and css_from_dir() require RCU read lock
      whether the caller is holding cgroup_mutex or not, which is
      unnecessarily restrictive.  They are all safe to use under
      cgroup_mutex without holding RCU read lock.
      
      Factor out cgroup_assert_mutex_or_rcu_locked() from css_from_id() and
      apply it to all css iteration functions and css_from_dir().
      
      v2: cgroup_assert_mutex_or_rcu_locked() definition doesn't need to be
          inside CONFIG_PROVE_RCU ifdef as rcu_lockdep_assert() is always
          defined and conditionalized.  Move it outside of the ifdef block.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      87fb54f1
    • T
      Merge branch 'for-3.13-fixes' into for-3.14 · e58e1ca4
      Tejun Heo 提交于
      Pulling in as patches depending on 266ccd50 ("cgroup: fix
      cgroup_create() error handling path") are scheduled.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e58e1ca4
    • T
      cgroup: fix cgroup_create() error handling path · 266ccd50
      Tejun Heo 提交于
      ae7f164a ("cgroup: move cgroup->subsys[] assignment to
      online_css()") moved cgroup->subsys[] assignements later in
      cgroup_create() but didn't update error handling path accordingly
      leading to the following oops and leaking later css's after an
      online_css() failure.  The oops is from cgroup destruction path being
      invoked on the partially constructed cgroup which is not ready to
      handle empty slots in cgrp->subsys[] array.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        IP: [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        PGD a780a067 PUD aadbe067 PMD 0
        Oops: 0000 [#1] SMP
        Modules linked in:
        CPU: 6 PID: 7360 Comm: mkdir Not tainted 3.13.0-rc2+ #69
        Hardware name:
        task: ffff8800b9dbec00 ti: ffff8800a781a000 task.ti: ffff8800a781a000
        RIP: 0010:[<ffffffff810eeaa8>]  [<ffffffff810eeaa8>] cgroup_destroy_locked+0x118/0x2f0
        RSP: 0018:ffff8800a781bd98  EFLAGS: 00010282
        RAX: ffff880586903878 RBX: ffff880586903800 RCX: ffff880586903820
        RDX: ffff880586903860 RSI: ffff8800a781bdb0 RDI: ffff880586903820
        RBP: ffff8800a781bde8 R08: ffff88060e0b8048 R09: ffffffff811d7bc1
        R10: 000000000000008c R11: 0000000000000001 R12: ffff8800a72286c0
        R13: 0000000000000000 R14: ffffffff81cf7a40 R15: 0000000000000001
        FS:  00007f60ecda57a0(0000) GS:ffff8806272c0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000008 CR3: 00000000a7a03000 CR4: 00000000000007e0
        Stack:
         ffff880586903860 ffff880586903910 ffff8800a72286c0 ffff880586903820
         ffffffff81cf7a40 ffff880586903800 ffff88060e0b8018 ffffffff81cf7a40
         ffff8800b9dbec00 ffff8800b9dbf098 ffff8800a781bec8 ffffffff810ef5bf
        Call Trace:
         [<ffffffff810ef5bf>] cgroup_mkdir+0x55f/0x5f0
         [<ffffffff811c90ae>] vfs_mkdir+0xee/0x140
         [<ffffffff811cb07e>] SyS_mkdirat+0x6e/0xf0
         [<ffffffff811c6a19>] SyS_mkdir+0x19/0x20
         [<ffffffff8169e569>] system_call_fastpath+0x16/0x1b
      
      This patch moves reference bumping inside online_css() loop, clears
      css_ar[] as css's are brought online successfully, and updates
      err_destroy path so that either a css is fully online and destroyed by
      cgroup_destroy_locked() or the error path frees it.  This creates a
      duplicate css free logic in the error path but it will be cleaned up
      soon.
      
      v2: Li pointed out that cgroup_destroy_locked() would do NULL-deref if
          invoked with a cgroup which doesn't have all css's populated.
          Update cgroup_destroy_locked() so that it skips NULL css's.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Reported-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: stable@vger.kernel.org # v3.12+
      266ccd50
  2. 06 12月, 2013 12 次提交
    • T
      cgroup: unify pidlist and other file handling · 6612f05b
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  With the previous
      changes, the difference between pidlist and other files are very
      small.  Both are served by seq_file in a pretty standard way with the
      only difference being !pidlist files use single_open().
      
      This patch adds cftype->seq_start(), ->seq_next and ->seq_stop() and
      implements the matching cgroup_seqfile_start/next/stop() which either
      emulates single_open() behavior or invokes cftype->seq_*() operations
      if specified.  This allows using single seq_operations for both
      pidlist and other files and makes cgroup_pidlist_operations and
      cgorup_pidlist_open() no longer necessary.  As cgroup_pidlist_open()
      was the only user of cftype->open(), the method is dropped together.
      
      This brings cftype file interface very close to kernfs interface and
      mapping shouldn't be too difficult.  Once converted to kernfs, most of
      the plumbing code including cgroup_seqfile_*() will be removed as
      kernfs provides those facilities.
      
      This patch does not introduce any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      
      v3: Refreshed on top of the updated "cgroup: attach cgroup_open_file
          to all cgroup files".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6612f05b
    • T
      cgroup: replace cftype->read_seq_string() with cftype->seq_show() · 2da8ca82
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      replaces cftype->read_seq_string() with cftype->seq_show() which is
      not limited to single_open() operation and will map directcly to
      kernfs seq_file interface.
      
      The conversions are mechanical.  As ->seq_show() doesn't have @css and
      @cft, the functions which make use of them are converted to use
      seq_css() and seq_cft() respectively.  In several occassions, e.f. if
      it has seq_string in its name, the function name is updated to fit the
      new method better.
      
      This patch does not introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NAristeu Rozanski <arozansk@redhat.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      2da8ca82
    • T
      cgroup: attach cgroup_open_file to all cgroup files · 7da11279
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch
      attaches cgroup_open_file, which used to be attached to pidlist files,
      to all cgroup files, introduces seq_css/cft() accessors to determine
      the cgroup_subsys_state and cftype associated with a given cgroup
      seq_file, exports them as public interface.
      
      This doesn't cause any behavior changes but unifies cgroup file
      handling across different file types and will help converting them to
      kernfs seq_show() interface.
      
      v2: Li pointed out that the original patch was using
          single_open_size() incorrectly assuming that the size param is
          private data size.  Fix it by allocating @of separately and
          passing it to single_open() and explicitly freeing it in the
          release path.  This isn't the prettiest but this path is gonna be
          restructured by the following patches pretty soon.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7da11279
    • T
      cgroup: generalize cgroup_pidlist_open_file · 5d22444f
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is
      updated so that it can be easily mapped to kernfs.  This patch renames
      cgroup_pidlist_open_file to cgroup_open_file and updates it so that it
      only contains a field to identify the specific file, ->cfe, and an
      opaque ->priv pointer.  When cgroup is converted to kernfs, this will
      be replaced by kernfs_open_file which contains about the same
      information.
      
      As whether the file is "cgroup.procs" or "tasks" should now be
      determined from cgroup_open_file->cfe, the cftype->private for the two
      files now carry the file type and cgroup_pidlist_start() reads the
      type through cfe->type->private.  This makes the distinction between
      cgroup_tasks_open() and cgroup_procs_open() unnecessary.
      cgroup_pidlist_open() is now directly used as the open method.
      
      This patch doesn't make any behavior changes.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5d22444f
    • T
      cgroup: unify read path so that seq_file is always used · 896f5199
      Tejun Heo 提交于
      With the recent removal of cftype->read() and ->read_map(), only three
      operations are remaining, ->read_u64(), ->read_s64() and
      ->read_seq_string().  Currently, the first two are handled directly
      while the last is handled through seq_file.
      
      It is trivial to serve the first two through the seq_file path too.
      This patch restructures read path so that all operations are served
      through cgroup_seqfile_show().  This makes all cgroup files seq_file -
      single_open/release() are now used by default,
      cgroup_seqfile_operations is dropped, and cgroup_file_operations uses
      seq_read() for read.
      
      This simplifies the code and makes the read path easy to convert to
      use kernfs.
      
      Note that, while cgroup_file_operations uses seq_read() for read, it
      still uses generic_file_llseek() for seeking instead of seq_lseek().
      This is different from cgroup_seqfile_operations but shouldn't break
      anything and brings the seeking behavior aligned with kernfs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      896f5199
    • T
      cgroup: unify cgroup_write_X64() and cgroup_write_string() · a742c59d
      Tejun Heo 提交于
      cgroup_write_X64() and cgroup_write_string() both implement about the
      same buffering logic.  Unify the two into cgroup_file_write() which
      always allocates dynamic buffer for simplicity and uses kstrto*()
      instead of simple_strto*().
      
      This patch doesn't make any visible behavior changes except for
      possibly different error value from kstrsto*().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a742c59d
    • T
      cgroup: remove cftype->read(), ->read_map() and ->write() · 6e0755b0
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      After recent updates, ->read() and ->read_map() don't have any user
      left and ->write() never had any user.  Remove them.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      6e0755b0
    • T
      hugetlb_cgroup: convert away from cftype->read() · 716f479d
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      All users of cftype->read() can be easily served, usually better, by
      seq_file and other methods.  Update hugetlb_cgroup_read() to return
      u64 instead of printing itself and rename it to
      hugetlb_cgroup_read_u64().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      716f479d
    • T
      netprio_cgroup: convert away from cftype->read_map() · e92e113c
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string().  Update read_priomap() to use ->read_seq_string()
      instead.
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e92e113c
    • T
      memcg: convert away from cftype->read() and ->read_map() · 791badbd
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string(), and all users of cftype->read() can be easily
      served, usually better, by seq_file and other methods.
      
      Update mem_cgroup_read() to return u64 instead of printing itself and
      rename it to mem_cgroup_read_u64(), and update
      mem_cgroup_oom_control_read() to use ->read_seq_string() instead of
      ->read_map().
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      791badbd
    • T
      cpuset: convert away from cftype->read() · 51ffe411
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      All users of cftype->read() can be easily served, usually better, by
      seq_file and other methods.  Rename cpuset_common_file_read() to
      cpuset_common_read_seq_string() and convert it to use
      read_seq_string() interface instead.  This not only simplifies the
      code but also makes it more versatile.  Before, the file couldn't
      output if the result is longer than PAGE_SIZE.  After the conversion,
      seq_file automatically grows the buffer until the output can fit.
      
      This patch doesn't make any visible behavior changes except for being
      able to handle output larger than PAGE_SIZE.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      51ffe411
    • T
      cgroup, sched: convert away from cftype->read_map() · 44ffc75b
      Tejun Heo 提交于
      In preparation of conversion to kernfs, cgroup file handling is being
      consolidated so that it can be easily mapped to the seq_file based
      interface of kernfs.
      
      cftype->read_map() doesn't add any value and being replaced with
      ->read_seq_string().  Update cpu_stats_show() and cpuacct_stats_show()
      accordingly.
      
      This patch doesn't make any visible behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      44ffc75b
  3. 29 11月, 2013 9 次提交
    • T
      cgroup: don't guarantee cgroup.procs is sorted if sane_behavior · afb2bc14
      Tejun Heo 提交于
      For some reason, tasks and cgroup.procs guarantee that the result is
      sorted.  This is the only reason this whole pidlist logic is necessary
      instead of just iterating through sorted member tasks.  We can't do
      anything about the existing interface but at least ensure that such
      expectation doesn't exist for the new interface so that pidlist logic
      may be removed in the distant future.
      
      This patch scrambles the sort order if sane_behavior so that the
      output is usually not sorted in the new interface.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      afb2bc14
    • T
      cgroup: remove cgroup_pidlist->use_count · 04502365
      Tejun Heo 提交于
      After the recent changes, pidlist ref is held only between
      cgroup_pidlist_start() and cgroup_pidlist_stop() during which
      cgroup->pidlist_mutex is also held.  IOW, the reference count is
      redundant now.  While in use, it's always one and pidlist_mutex is
      held - holding the mutex has exactly the same protection.
      
      This patch collapses destroy_dwork queueing into cgroup_pidlist_stop()
      so that pidlist_mutex is not released inbetween and drops
      pidlist->use_count.
      
      This patch shouldn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      04502365
    • T
      cgroup: load and release pidlists from seq_file start and stop respectively · 4bac00d1
      Tejun Heo 提交于
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      The previous patches implemented delayed release and restructured
      pidlist handling so that pidlists can be loaded and released from
      seq_file start / stop.  This patch actually moves pidlist load to
      start and release to stop.
      
      This means that pidlist is pinned only between start and stop and may
      go away between two consecutive read calls if the two calls are apart
      by more than CGROUP_PIDLIST_DESTROY_DELAY.  cgroup_pidlist_start()
      thus can't re-use the stored cgroup_pid_list_open_file->pidlist
      directly.  During start, it's only used as a hint indicating whether
      this is the first start after open or not and pidlist is always looked
      up or created.
      
      pidlist_mutex locking and reference counting are moved out of
      pidlist_array_load() so that pidlist_array_load() can perform lookup
      and creation atomically.  While this enlarges the area covered by
      pidlist_mutex, given how the lock is used, it's highly unlikely to be
      noticeable.
      
      v2: Refreshed on top of the updated "cgroup: introduce struct
          cgroup_pidlist_open_file".
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      4bac00d1
    • T
      cgroup: remove cgroup_pidlist->rwsem · 069df3b7
      Tejun Heo 提交于
      cgroup_pidlist locking is needlessly complicated.  It has outer
      cgroup->pidlist_mutex to protect the list of pidlists associated with
      a cgroup and then each pidlist has rwsem to synchronize updates and
      reads.  Given that the only read access is from seq_file operations
      which are always invoked back-to-back, the rwsem is a giant overkill.
      All it does is adding unnecessary complexity.
      
      This patch removes cgroup_pidlist->rwsem and protects all accesses to
      pidlists belonging to a cgroup with cgroup->pidlist_mutex.
      pidlist->rwsem locking is removed if it's nested inside
      cgroup->pidlist_mutex; otherwise, it's replaced with
      cgroup->pidlist_mutex locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      069df3b7
    • T
      cgroup: refactor cgroup_pidlist_find() · e6b81710
      Tejun Heo 提交于
      Rename cgroup_pidlist_find() to cgroup_pidlist_find_create() and
      separate out finding proper to cgroup_pidlist_find().  Also, move
      locking to the caller.
      
      This patch is preparation for pidlist restructure and doesn't
      introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e6b81710
    • T
      cgroup: introduce struct cgroup_pidlist_open_file · 62236858
      Tejun Heo 提交于
      For pidlist files, seq_file->private pointed to the loaded
      cgroup_pidlist; however, pidlist loading is planned to be moved to
      cgroup_pidlist_start() for kernfs conversion and seq_file->private
      needs to carry more information from open to allow that.
      
      This patch introduces struct cgroup_pidlist_open_file which contains
      type, cgrp and pidlist and updates pidlist seq_file->private to point
      to it using seq_open_private() and seq_release_private().  Note that
      this eventually will be replaced by kernfs_open_file.
      
      While this patch makes more information available to seq_file
      operations, they don't use it yet and this patch doesn't introduce any
      behavior changes except for allocation of the extra private struct.
      
      v2: use __seq_open_private() instead of seq_open_private() for brevity
          as suggested by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      62236858
    • T
      cgroup: implement delayed destruction for cgroup_pidlist · b1a21367
      Tejun Heo 提交于
      Currently, pidlists are reference counted from file open and release
      methods.  This means that holding onto an open file may waste memory
      and reads may return data which is very stale.  Both aren't critical
      because pidlists are keyed and shared per namespace and, well, the
      user isn't supposed to have large delay between open and reads.
      
      cgroup is planned to be converted to use kernfs and it'd be best if we
      can stick to just the seq_file operations - start, next, stop and
      show.  This can be achieved by loading pidlist on demand from start
      and release with time delay from stop, so that consecutive reads don't
      end up reloading the pidlist on each iteration.  This would remove the
      need for hooking into open and release while also avoiding issues with
      holding onto pidlist for too long.
      
      This patch implements delayed release of pidlist.  As pidlists could
      be lingering on cgroup removal waiting for the timer to expire, cgroup
      free path needs to queue the destruction work item immediately and
      flush.  As those work items are self-destroying, each work item can't
      be flushed directly.  A new workqueue - cgroup_pidlist_destroy_wq - is
      added to serve as flush domain.
      
      Note that this patch just adds delayed release on top of the current
      implementation and doesn't change where pidlist is loaded and
      released.  Following patches will make those changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b1a21367
    • T
      cgroup: remove cftype->release() · b9f3ceca
      Tejun Heo 提交于
      Now that pidlist files don't use cftype->release(), it doesn't have
      any user left.  Remove it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      b9f3ceca
    • T
      cgroup: don't skip seq_open on write only opens on pidlist files · ac1e69aa
      Tejun Heo 提交于
      Currently, cgroup_pidlist_open() skips seq_open() and pidlist loading
      if the file is opened write-only, which is a sensible optimization as
      pidlist loading can be costly and there often are occasions where
      tasks or cgroup.procs is opened write-only.  However, pidlist init and
      release are planned to be moved to cgroup_pidlist_start/stop()
      respectively which would make this optimization unnecessary.
      
      This patch removes the optimization and always fully initializes
      pidlist files regardless of open mode.  This will help moving pidlist
      handling to start/stop by unifying rw paths and removes the need for
      specifying cftype->release() in addition to .release in
      cgroup_pidlist_operations as file->f_op is now always overridden.  As
      pidlist files were the only user of cftype->release(), the next patch
      will remove the method.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ac1e69aa
  4. 28 11月, 2013 3 次提交
    • T
      cgroup: Merge branch 'for-3.13-fixes' into for-3.14 · c729b11e
      Tejun Heo 提交于
      Pull to receive e605b365 ("cgroup: fix cgroup_subsys_state leak
      for seq_files") as for-3.14 is scheduled to have a lot of changes
      which depend on it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c729b11e
    • T
      cgroup: fix cgroup_subsys_state leak for seq_files · e605b365
      Tejun Heo 提交于
      If a cgroup file implements either read_map() or read_seq_string(),
      such file is served using seq_file by overriding file->f_op to
      cgroup_seqfile_operations, which also overrides the release method to
      single_release() from cgroup_file_release().
      
      Because cgroup_file_open() didn't use to acquire any resources, this
      used to be fine, but since f7d58818 ("cgroup: pin
      cgroup_subsys_state when opening a cgroupfs file"), cgroup_file_open()
      pins the css (cgroup_subsys_state) which is put by
      cgroup_file_release().  The patch forgot to update the release path
      for seq_files and each open/release cycle leaks a css reference.
      
      Fix it by updating cgroup_file_release() to also handle seq_files and
      using it for seq_file release path too.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v3.12
      e605b365
    • P
      cpuset: Fix memory allocator deadlock · 0fc0287c
      Peter Zijlstra 提交于
      Juri hit the below lockdep report:
      
      [    4.303391] ======================================================
      [    4.303392] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    4.303394] 3.12.0-dl-peterz+ #144 Not tainted
      [    4.303395] ------------------------------------------------------
      [    4.303397] kworker/u4:3/689 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    4.303399]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff8114e63c>] new_slab+0x6c/0x290
      [    4.303417]
      [    4.303417] and this task is already holding:
      [    4.303418]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff812d2dfb>] blk_execute_rq_nowait+0x5b/0x100
      [    4.303431] which would create a new lock dependency:
      [    4.303432]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    4.303436]
      
      [    4.303898] the dependencies between the lock to be acquired and SOFTIRQ-irq-unsafe lock:
      [    4.303918] -> (&p->mems_allowed_seq){+.+...} ops: 2762 {
      [    4.303922]    HARDIRQ-ON-W at:
      [    4.303923]                     [<ffffffff8108ab9a>] __lock_acquire+0x65a/0x1ff0
      [    4.303926]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303929]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303931]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303933]    SOFTIRQ-ON-W at:
      [    4.303933]                     [<ffffffff8108abcc>] __lock_acquire+0x68c/0x1ff0
      [    4.303935]                     [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303940]                     [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303955]                     [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303959]    INITIAL USE at:
      [    4.303960]                    [<ffffffff8108a884>] __lock_acquire+0x344/0x1ff0
      [    4.303963]                    [<ffffffff8108cbe3>] lock_acquire+0x93/0x140
      [    4.303966]                    [<ffffffff81063dd6>] kthreadd+0x86/0x180
      [    4.303969]                    [<ffffffff816ded6c>] ret_from_fork+0x7c/0xb0
      [    4.303972]  }
      
      Which reports that we take mems_allowed_seq with interrupts enabled. A
      little digging found that this can only be from
      cpuset_change_task_nodemask().
      
      This is an actual deadlock because an interrupt doing an allocation will
      hit get_mems_allowed()->...->__read_seqcount_begin(), which will spin
      forever waiting for the write side to complete.
      
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Reported-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NJuri Lelli <juri.lelli@gmail.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org
      0fc0287c
  5. 23 11月, 2013 9 次提交
    • T
      cgroup: Merge branch 'memcg_event' into for-3.14 · edab9510
      Tejun Heo 提交于
      Merge v3.12 based patch series to move cgroup_event implementation to
      memcg into for-3.14.  The following two commits cause a conflict in
      kernel/cgroup.c
      
        2ff2a7d0 ("cgroup: kill css_id")
        79bd9814 ("cgroup, memcg: move cgroup_event implementation to memcg")
      
      Each patch removes a struct definition from kernel/cgroup.c.  As the
      two are adjacent, they cause a context conflict.  Easily resolved by
      removing both structs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      edab9510
    • T
      cgroup: unexport cgroup_css() and remove __file_cft() · b36824c7
      Tejun Heo 提交于
      Now that cgroup_event is made memcg specific, the temporarily exported
      functions are no longer necessary.  Unexport cgroup_css() and remove
      __file_cft() which doesn't have any user left.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      b36824c7
    • T
      memcg: rename cgroup_event to mem_cgroup_event · 3bc942f3
      Tejun Heo 提交于
      cgroup_event is only available in memcg now.  Let's brand it that way.
      While at it, add a comment encouraging deprecation of the feature and
      remove the respective section from cgroup documentation.
      
      This patch is cosmetic.
      
      v3: Typo update as per Li Zefan.
      
      v2: Index in cgroups.txt updated accordingly as suggested by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      3bc942f3
    • T
      memcg: make cgroup_event deal with mem_cgroup instead of cgroup_subsys_state · 59b6f873
      Tejun Heo 提交于
      cgroup_event is now memcg specific.  Replace cgroup_event->css with
      ->memcg and convert [un]register_event() callbacks to take mem_cgroup
      pointer instead of cgroup_subsys_state one.  This simplifies the code
      slightly and makes css_to_vmpressure() unnecessary which is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      59b6f873
    • T
      memcg: remove cgroup_event->cft · 347c4a87
      Tejun Heo 提交于
      The only use of cgroup_event->cft is distinguishing "usage_in_bytes"
      and "memsw.usgae_in_bytes" for mem_cgroup_usage_[un]register_event(),
      which can be done by adding an explicit argument to the function and
      implementing two wrappers so that the two cases can be distinguished
      from the function alone.
      
      Remove cgroup_event->cft and the related code including
      [un]register_events() methods.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      347c4a87
    • T
      cgroup, memcg: move cgroup->event_list[_lock] and event callbacks into memcg · fba94807
      Tejun Heo 提交于
      cgroup_event is being moved from cgroup core to memcg and the
      implementation is already moved by the previous patch.  This patch
      moves the data fields and callbacks.
      
      * cgroup->event_list[_lock] are moved to mem_cgroup.
      
      * cftype->[un]register_event() are moved to cgroup_event.  This makes
        it impossible for individual cftype definitions to specify their
        event callbacks.  This is worked around by simply hard-coding
        filename to event callback mapping in cgroup_write_event_control().
        This is awkward and inflexible, which is actually desirable given
        that we don't want to grow more usages of this feature.
      
      * eventfd_ctx declaration is removed from cgroup.h, which makes
        vmpressure.h miss eventfd_ctx declaration.  Include eventfd.h from
        vmpressure.h.
      
      v2: Use file name from dentry instead of cftype.  This will allow
          removing all cftype handling in the function.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      fba94807
    • T
      memcg: cgroup_write_event_control() now knows @css is for memcg · b5557c4c
      Tejun Heo 提交于
      @css for cgroup_write_event_control() is now always for memcg and the
      target file should be a memcg file too.  Drop code which assumes @css
      is dummy_css and the target file may belong to different subsystems.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      b5557c4c
    • T
      cgroup, memcg: move cgroup_event implementation to memcg · 79bd9814
      Tejun Heo 提交于
      cgroup_event is way over-designed and tries to build a generic
      flexible event mechanism into cgroup - fully customizable event
      specification for each user of the interface.  This is utterly
      unnecessary and overboard especially in the light of the planned
      unified hierarchy as there's gonna be single agent.  Simply generating
      events at fixed points, or if that's too restrictive, configureable
      cadence or single set of configureable points should be enough.
      
      Thankfully, memcg is the only user and gets to keep it.  Replacing it
      with something simpler on sane_behavior is strongly recommended.
      
      This patch moves cgroup_event and "cgroup.event_control"
      implementation to mm/memcontrol.c.  Clearing of events on cgroup
      destruction is moved from cgroup_destroy_locked() to
      mem_cgroup_css_offline(), which shouldn't make any noticeable
      difference.
      
      cgroup_css() and __file_cft() are exported to enable the move;
      however, this will soon be reverted once the event code is updated to
      be memcg specific.
      
      Note that "cgroup.event_control" will now exist only on the hierarchy
      with memcg attached to it.  While this change is visible to userland,
      it is unlikely to be noticeable as the file has never been meaningful
      outside memcg.
      
      Aside from the above change, this is pure code relocation.
      
      v2: Per Li Zefan's comments, init/Kconfig updated accordingly and
          poll.h inclusion moved from cgroup.c to memcontrol.c.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      79bd9814
    • T
      cgroup: use a dedicated workqueue for cgroup destruction · e5fca243
      Tejun Heo 提交于
      Since be445626 ("cgroup: remove synchronize_rcu() from
      cgroup_diput()"), cgroup destruction path makes use of workqueue.  css
      freeing is performed from a work item from that point on and a later
      commit, ea15f8cc ("cgroup: split cgroup destruction into two
      steps"), moves css offlining to workqueue too.
      
      As cgroup destruction isn't depended upon for memory reclaim, the
      destruction work items were put on the system_wq; unfortunately, some
      controller may block in the destruction path for considerable duration
      while holding cgroup_mutex.  As large part of destruction path is
      synchronized through cgroup_mutex, when combined with high rate of
      cgroup removals, this has potential to fill up system_wq's max_active
      of 256.
      
      Also, it turns out that memcg's css destruction path ends up queueing
      and waiting for work items on system_wq through work_on_cpu().  If
      such operation happens while system_wq is fully occupied by cgroup
      destruction work items, work_on_cpu() can't make forward progress
      because system_wq is full and other destruction work items on
      system_wq can't make forward progress because the work item waiting
      for work_on_cpu() is holding cgroup_mutex, leading to deadlock.
      
      This can be fixed by queueing destruction work items on a separate
      workqueue.  This patch creates a dedicated workqueue -
      cgroup_destroy_wq - for this purpose.  As these work items shouldn't
      have inter-dependencies and mostly serialized by cgroup_mutex anyway,
      giving high concurrency level doesn't buy anything and the workqueue's
      @max_active is set to 1 so that destruction work items are executed
      one by one on each CPU.
      
      Hugh Dickins: Because cgroup_init() is run before init_workqueues(),
      cgroup_destroy_wq can't be allocated from cgroup_init().  Do it from a
      separate core_initcall().  In the future, we probably want to reorder
      so that workqueue init happens before cgroup_init().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      Reported-by: NShawn Bohrer <shawn.bohrer@gmail.com>
      Link: http://lkml.kernel.org/r/20131111220626.GA7509@sbohrermbp13-local.rgmadvisors.com
      Link: http://lkml.kernel.org/g/alpine.LNX.2.00.1310301606080.2333@eggly.anvils
      Cc: stable@vger.kernel.org # v3.9+
      e5fca243