1. 09 8月, 2013 4 次提交
    • T
      cgroup: add css_parent() · 63876986
      Tejun Heo 提交于
      Currently, controllers have to explicitly follow the cgroup hierarchy
      to find the parent of a given css.  cgroup is moving towards using
      cgroup_subsys_state as the main controller interface construct, so
      let's provide a way to climb the hierarchy using just csses.
      
      This patch implements css_parent() which, given a css, returns its
      parent.  The function is guarnateed to valid non-NULL parent css as
      long as the target css is not at the top of the hierarchy.
      
      freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
      are converted to use css_parent() instead of accessing cgroup->parent
      directly.
      
      * __parent_ca() is dropped from cpuacct and its usage is replaced with
        parent_ca().  The only difference between the two was NULL test on
        cgroup->parent which is now embedded in css_parent() making the
        distinction moot.  Note that eventually a css->parent field will be
        added to css and the NULL check in css_parent() will go away.
      
      This patch shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      63876986
    • T
      cgroup: add/update accessors which obtain subsys specific data from css · a7c6d554
      Tejun Heo 提交于
      css (cgroup_subsys_state) is usually embedded in a subsys specific
      data structure.  Subsystems either use container_of() directly to cast
      from css to such data structure or has an accessor function wrapping
      such cast.  As cgroup as whole is moving towards using css as the main
      interface handle, add and update such accessors to ease dealing with
      css's.
      
      All accessors explicitly handle NULL input and return NULL in those
      cases.  While this looks like an extra branch in the code, as all
      controllers specific data structures have css as the first field, the
      casting doesn't involve any offsetting and the compiler can trivially
      optimize out the branch.
      
      * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
        accessor.  Added.
      
      * memory, hugetlb and devices already had one but didn't explicitly
        handle NULL input.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a7c6d554
    • T
      cpuset: drop "const" qualifiers from struct cpuset instances · c9710d80
      Tejun Heo 提交于
      cpuset uses "const" qualifiers on struct cpuset in some functions;
      however, it doesn't work well when a value derived from returned const
      pointer has to be passed to an accessor.  It's C after all.
      
      Drop the "const" qualifiers except for the trivially leaf ones.  This
      patch doesn't make any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c9710d80
    • T
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo 提交于
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8af01f56
  2. 31 7月, 2013 1 次提交
  3. 30 7月, 2013 2 次提交
  4. 19 6月, 2013 1 次提交
  5. 14 6月, 2013 6 次提交
    • L
      cpuset: rename @cont to @cgrp · c9e5fe66
      Li Zefan 提交于
      Cont is short for container. control group was named process container
      at first, but then people found container already has a meaning in
      linux kernel.
      
      Clean up the leftover variable name @cont.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      c9e5fe66
    • L
      cpuset: fix to migrate mm correctly in a corner case · f047cecf
      Li Zefan 提交于
      Before moving tasks out of empty cpusets, update_tasks_nodemask()
      is called, which calls do_migrate_pages(xx, from, to). Then those
      tasks are moved to an ancestor, and do_migrate_pages() is called
      again.
      
      The first time: from = node_to_be_offlined, to = empty.
      The second time: from = empty, to = ancestor's nodemask.
      
      so looks like no pages will be migrated.
      
      Fix this by:
      
      - Don't call update_tasks_nodemask() on empty cpusets.
      - Pass cs->old_mems_allowed to do_migrate_pages().
      
      v4: added comment in cpuset_hotplug_update_tasks() and rephased comment
          in cpuset_attach().
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      f047cecf
    • L
      cpuset: allow to move tasks to empty cpusets · 88fa523b
      Li Zefan 提交于
      Currently some cpuset behaviors are not friendly when cpuset is co-mounted
      with other cgroup controllers.
      
      Now with this patchset if cpuset is mounted with sane_behavior option,
      it behaves differently:
      
      - Tasks will be kept in empty cpusets when hotplug happens and take
        masks of ancestors with non-empty cpus/mems, instead of being moved to
        an ancestor.
      
      - A task can be moved into an empty cpuset, and again it takes masks of
        ancestors, so the user can drop a task into a newly created cgroup without
        having to do anything for it.
      
      As tasks can reside in empy cpusets, here're some rules:
      
      - They can be moved to another cpuset, regardless it's empty or not.
      
      - Though it takes masks from ancestors, it takes other configs from the
        empty cpuset.
      
      - If the ancestors' masks are changed, those tasks will also be updated
        to take new masks.
      
      v2: add documentation in include/linux/cgroup.h
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      88fa523b
    • L
      cpuset: allow to keep tasks in empty cpusets · 5c5cc623
      Li Zefan 提交于
      To achieve this:
      
      - We call update_tasks_cpumask/nodemask() for empty cpusets when
      hotplug happens, instead of moving tasks out of them.
      
      - When a cpuset's masks are changed by writing cpuset.cpus/mems,
      we also update tasks in child cpusets which are empty.
      
      v3:
      - do propagation work in one place for both hotplug and unplug
      
      v2:
      - drop rcu_read_lock before calling update_task_nodemask() and
        update_task_cpumask(), instead of using workqueue.
      - add documentation in include/linux/cgroup.h
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5c5cc623
    • L
      cpuset: introduce effective_{cpumask|nodemask}_cpuset() · 070b57fc
      Li Zefan 提交于
      effective_cpumask_cpuset() returns an ancestor cpuset which has
      non-empty cpumask.
      
      If a cpuset is empty and the tasks in it need to update their
      cpus_allowed, they take on the ancestor cpuset's cpumask.
      
      This currently won't change any behavior, but it will later allow us
      to keep tasks in empty cpusets.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      070b57fc
    • L
      cpuset: record old_mems_allowed in struct cpuset · 33ad801d
      Li Zefan 提交于
      When we update a cpuset's mems_allowed and thus update tasks'
      mems_allowed, it's required to pass the old mems_allowed and new
      mems_allowed to cpuset_migrate_mm().
      
      Currently we save old mems_allowed in a temp local variable before
      changing cpuset->mems_allowed. This patch changes it by saving
      old mems_allowed in cpuset->old_mems_allowed.
      
      This currently won't change any behavior, but it will later allow
      us to keep tasks in empty cpusets.
      
      v3: restored "cpuset_attach_nodemask_to = cs->mems_allowed"
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      33ad801d
  6. 09 6月, 2013 2 次提交
  7. 06 6月, 2013 5 次提交
  8. 02 5月, 2013 1 次提交
  9. 30 4月, 2013 1 次提交
  10. 28 4月, 2013 1 次提交
  11. 27 4月, 2013 2 次提交
    • L
      cpuset: fix cpu hotplug vs rebuild_sched_domains() race · 5b16c2a4
      Li Zefan 提交于
      rebuild_sched_domains() might pass doms with offlined cpu to
      partition_sched_domains(), which results in an oops:
      
      general protection fault: 0000 [#1] SMP
      ...
      RIP: 0010:[<ffffffff81077a1e>]  [<ffffffff81077a1e>] get_group+0x6e/0x90
      ...
      Call Trace:
       [<ffffffff8107f07c>] build_sched_domains+0x70c/0xcb0
       [<ffffffff8107f2a7>] ? build_sched_domains+0x937/0xcb0
       [<ffffffff81173f64>] ? kfree+0xe4/0x1b0
       [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
       [<ffffffff8107f905>] partition_sched_domains+0x2e5/0x470
       [<ffffffff8107f6e0>] ? partition_sched_domains+0xc0/0x470
       [<ffffffff810c9007>] ? generate_sched_domains+0xc7/0x530
       [<ffffffff810c94a8>] rebuild_sched_domains_locked+0x38/0x70
       [<ffffffff810cb4a4>] cpuset_write_resmask+0x1a4/0x500
       [<ffffffff810c8700>] ? cpuset_mount+0xe0/0xe0
       [<ffffffff810c7f50>] ? cpuset_read_u64+0x100/0x100
       [<ffffffff810be890>] ? cgroup_iter_next+0x90/0x90
       [<ffffffff810cb300>] ? cpuset_css_offline+0x70/0x70
       [<ffffffff810c1a73>] cgroup_file_write+0x133/0x2e0
       [<ffffffff8118995b>] vfs_write+0xcb/0x130
       [<ffffffff8118a174>] sys_write+0x64/0xa0
      Reported-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      5b16c2a4
    • L
      cpuset: use rebuild_sched_domains() in cpuset_hotplug_workfn() · e0e80a02
      Li Zhong 提交于
      In cpuset_hotplug_workfn(), partition_sched_domains() is called without
      hotplug lock held, which is actually needed (stated in the function
      header of partition_sched_domains()).
      
      This patch tries to use rebuild_sched_domains() to solve the above
      issue, and makes the code looks a little simpler.
      Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      e0e80a02
  12. 08 4月, 2013 1 次提交
    • T
      cgroup, cpuset: replace move_member_tasks_to_cpuset() with cgroup_transfer_tasks() · 8cc99345
      Tejun Heo 提交于
      When a cpuset becomes empty (no CPU or memory), its tasks are
      transferred with the nearest ancestor with execution resources.  This
      is implemented using cgroup_scan_tasks() with a callback which grabs
      cgroup_mutex and invokes cgroup_attach_task() on each task.
      
      Both cgroup_mutex and cgroup_attach_task() are scheduled to be
      unexported.  Implement cgroup_transfer_tasks() in cgroup proper which
      is essentially the same as move_member_tasks_to_cpuset() except that
      it takes cgroups instead of cpusets and @to comes before @from like
      normal functions with those arguments, and replace
      move_member_tasks_to_cpuset() with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8cc99345
  13. 20 3月, 2013 2 次提交
    • L
      cgroup: consolidate cgroup_attach_task() and cgroup_attach_proc() · 081aa458
      Li Zefan 提交于
      These two functions share most of the code.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      081aa458
    • T
      sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY · 14a40ffc
      Tejun Heo 提交于
      PF_THREAD_BOUND was originally used to mark kernel threads which were
      bound to a specific CPU using kthread_bind() and a task with the flag
      set allows cpus_allowed modifications only to itself.  Workqueue is
      currently abusing it to prevent userland from meddling with
      cpus_allowed of workqueue workers.
      
      What we need is a flag to prevent userland from messing with
      cpus_allowed of certain kernel tasks.  In kernel, anyone can
      (incorrectly) squash the flag, and, for worker-type usages,
      restricting cpus_allowed modification to the task itself doesn't
      provide meaningful extra proection as other tasks can inject work
      items to the task anyway.
      
      This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
      sched_setaffinity() checks the flag and return -EINVAL if set.
      set_cpus_allowed_ptr() is no longer affected by the flag.
      
      This will allow simplifying workqueue worker CPU affinity management.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      14a40ffc
  14. 13 3月, 2013 1 次提交
  15. 05 3月, 2013 1 次提交
  16. 19 2月, 2013 1 次提交
  17. 16 1月, 2013 2 次提交
    • L
      cpuset: drop spurious retval assignment in proc_cpuset_show() · d127027b
      Li Zefan 提交于
      proc_cpuset_show() has a spurious -EINVAL assignment which does
      nothing.  Remove it.
      
      This patch doesn't make any functional difference.
      
      tj: Rewrote patch description.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d127027b
    • L
      cpuset: fix RCU lockdep splat · 27e89ae5
      Li Zefan 提交于
      5d21cc2d ("cpuset: replace
      cgroup_mutex locking with cpuset internal locking") incorrectly
      converted proc_cpuset_show() from cgroup_lock() to cpuset_mutex.
      proc_cpuset_show() is accessing cgroup hierarchy proper to determine
      cgroup path which can't be protected by cpuset_mutex.  This triggered
      the following RCU warning.
      
       ===============================
       [ INFO: suspicious RCU usage. ]
       3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262 Tainted: G        W
       -------------------------------
       include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 1
       2 locks held by trinity/7514:
        #0:  (&p->lock){+.+.+.}, at: [<ffffffff812b06aa>] seq_read+0x3a/0x3e0
        #1:  (cpuset_mutex){+.+...}, at: [<ffffffff811abae4>] proc_cpuset_show+0x84/0x190
      
       stack backtrace:
       Pid: 7514, comm: trinity Tainted: G        W
      +3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262
       Call Trace:
        [<ffffffff81182cab>] lockdep_rcu_suspicious+0x10b/0x120
        [<ffffffff811abb71>] proc_cpuset_show+0x111/0x190
        [<ffffffff812b0827>] seq_read+0x1b7/0x3e0
        [<ffffffff812b0670>] ? seq_lseek+0x110/0x110
        [<ffffffff8128b4fb>] do_loop_readv_writev+0x4b/0x90
        [<ffffffff8128b776>] do_readv_writev+0xf6/0x1d0
        [<ffffffff8128b8ee>] vfs_readv+0x3e/0x60
        [<ffffffff8128b960>] sys_readv+0x50/0xd0
        [<ffffffff83d33d18>] tracesys+0xe1/0xe6
      
      The operation can be performed under RCU read lock.  Replace
      cpuset_mutex locking with RCU read locking.
      
      tj: Rewrote patch description.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      27e89ae5
  18. 08 1月, 2013 6 次提交
    • T
      cpuset: remove cpuset->parent · c431069f
      Tejun Heo 提交于
      cgroup already tracks the hierarchy.  Follow cgroup->parent to find
      the parent and drop cpuset->parent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c431069f
    • T
      cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre() · fc560a26
      Tejun Heo 提交于
      Implement cpuset_for_each_descendant_pre() and replace the
      cpuset-specific tree walking using cpuset->stack_list with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      fc560a26
    • T
      cpuset: replace cgroup_mutex locking with cpuset internal locking · 5d21cc2d
      Tejun Heo 提交于
      Supposedly for historical reasons, cpuset depends on cgroup core for
      locking.  It depends on cgroup_mutex in cgroup callbacks and grabs
      cgroup_mutex from other places where it wants to be synchronized.
      This is majorly messy and highly prone to introducing circular locking
      dependency especially because cgroup_mutex is supposed to be one of
      the outermost locks.
      
      As previous patches already plugged possible races which may happen by
      decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
      specific cpuset_mutex is mostly straight-forward.  Introduce
      cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
      cpuset_mutex locking to places which inherited cgroup_mutex from
      cgroup core.
      
      The only complication is from cpuset wanting to initiate task
      migration when a cpuset loses all cpus or memory nodes.  Task
      migration may go through full cgroup and all subsystem locking and
      should be initiated without holding any cpuset specific lock; however,
      a previous patch already made hotplug handled asynchronously and
      moving the task migration part outside other locks is easy.
      cpuset_propagate_hotplug_workfn() now invokes
      remove_tasks_in_empty_cpuset() without holding any lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5d21cc2d
    • T
      cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty · 02bb5863
      Tejun Heo 提交于
      cpuset is scheduled to be decoupled from cgroup_lock which will make
      hotplug handling race with task migration.  cpus or mems will be
      allowed to go offline between ->can_attach() and ->attach().  If
      hotplug takes down all cpus or mems of a cpuset while attach is in
      progress, ->attach() may end up putting tasks into an empty cpuset.
      
      This patchset makes ->attach() schedule hotplug propagation if the
      cpuset is empty after attaching is complete.  This will move the tasks
      to the nearest ancestor which can execute and the end result would be
      as if hotplug handling happened after the tasks finished attaching.
      
      cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
      wait for propagations scheduled directly by cpuset_attach().
      
      This currently doesn't make any functional difference as everything is
      protected by cgroup_mutex but enables decoupling the locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      02bb5863
    • T
      cpuset: pin down cpus and mems while a task is being attached · 452477fa
      Tejun Heo 提交于
      cpuset is scheduled to be decoupled from cgroup_lock which will make
      configuration updates race with task migration.  Any config update
      will be allowed to happen between ->can_attach() and ->attach().  If
      such config update removes either all cpus or mems, by the time
      ->attach() is called, the condition verified by ->can_attach(), that
      the cpuset is capable of hosting the tasks, is no longer true.
      
      This patch adds cpuset->attach_in_progress which is incremented from
      ->can_attach() and decremented when the attach operation finishes
      either successfully or not.  validate_change() treats cpusets w/
      non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
      remove all cpus or mems from it.
      
      This currently doesn't make any functional difference as everything is
      protected by cgroup_mutex but enables decoupling the locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      452477fa
    • T
      cpuset: make CPU / memory hotplug propagation asynchronous · 8d033948
      Tejun Heo 提交于
      cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
      directly to propagate hotplug updates to !root cpusets; however, this
      has the following problems.
      
      * cpuset locking is scheduled to be decoupled from cgroup_mutex,
        cgroup_mutex will be unexported, and cgroup_attach_task() will do
        cgroup locking internally, so propagation can't synchronously move
        tasks to a parent cgroup while walking the hierarchy.
      
      * We can't use cgroup generic tree iterator because propagation to
        each cpuset may sleep.  With propagation done asynchronously, we can
        lose the rather ugly cpuset specific iteration.
      
      Convert cpuset_propagate_hotplug() to
      cpuset_propagate_hotplug_workfn() and execute it from newly added
      cpuset->hotplug_work.  The work items are run on an ordered workqueue,
      so the propagation order is preserved.  cpuset_hotplug_workfn()
      schedules all propagations while holding cgroup_mutex and waits for
      completion without cgroup_mutex.  Each in-flight propagation holds a
      reference to the cpuset->css.
      
      This patch doesn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8d033948