1. 08 1月, 2013 9 次提交
    • T
      cpuset: drop async_rebuild_sched_domains() · 699140ba
      Tejun Heo 提交于
      In general, we want to make cgroup_mutex one of the outermost locks
      and be able to use get_online_cpus() and friends from cgroup methods.
      With cpuset hotplug made async, get_online_cpus() can now be nested
      inside cgroup_mutex.
      
      Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
      by bouncing sched_domain rebuilding to a work item.  As such nesting
      is allowed now, remove the workqueue bouncing code and always rebuild
      sched_domains synchronously.  This also nests sched_domains_mutex
      inside cgroup_mutex, which is intended and should be okay.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      699140ba
    • T
      cpuset: don't nest cgroup_mutex inside get_online_cpus() · 3a5a6d0c
      Tejun Heo 提交于
      CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
      event notifications.  We want to separate cpuset locking from cgroup
      core and make cgroup_mutex outer to hotplug synchronization so that,
      among other things, mechanisms which depend on get_online_cpus() can
      be used from cgroup callbacks.  In general, we want to keep
      cgroup_mutex the outermost lock to minimize locking interactions among
      different controllers.
      
      Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
      schedule it from the hotplug notifications.  As the function can
      already handle multiple mixed events without any input, converting it
      to a work function is mostly trivial; however, one complication is
      that cpuset_update_active_cpus() needs to update sched domains
      synchronously to reflect an offlined cpu to avoid confusing the
      scheduler.  This is worked around by falling back to the the default
      single sched domain synchronously before scheduling the actual hotplug
      work.  This makes sched domain rebuilt twice per CPU hotplug event but
      the operation isn't that heavy and a lot of the second operation would
      be noop for systems w/ single sched domain, which is the common case.
      
      This decouples cpuset hotplug handling from the notification callbacks
      and there can be an arbitrary delay between the actual event and
      updates to cpusets.  Scheduler and mm can handle it fine but moving
      tasks out of an empty cpuset may race against writes to the cpuset
      restoring execution resources which can lead to confusing behavior.
      Flush hotplug work item from cpuset_write_resmask() to avoid such
      confusions.
      
      v2: Synchronous sched domain rebuilding using the fallback sched
          domain added.  This fixes various issues caused by confused
          scheduler putting tasks on a dead CPU, including the one reported
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3a5a6d0c
    • T
      cpuset: reorganize CPU / memory hotplug handling · deb7aa30
      Tejun Heo 提交于
      Reorganize hotplug path to prepare for async hotplug handling.
      
      * Both CPU and memory hotplug handlings are collected into a single
        function - cpuset_handle_hotplug().  It doesn't take any argument
        but compares the current setttings of top_cpuset against what's
        actually available to determine what happened.  This function
        directly updates top_cpuset.  If there are CPUs or memory nodes
        which are taken down, cpuset_propagate_hotplug() in invoked on all
        !root cpusets.
      
      * cpuset_propagate_hotplug() is responsible for updating the specified
        cpuset so that it doesn't include any resource which isn't available
        to top_cpuset.  If no CPU or memory is left after update, all tasks
        are moved to the nearest ancestor with both resources.
      
      * update_tasks_cpumask() and update_tasks_nodemask() are now always
        called after cpus or mems masks are updated even if the cpuset
        doesn't have any task.  This is for brevity and not expected to have
        any measureable effect.
      
      * cpu_active_mask and N_HIGH_MEMORY are read exactly once per
        cpuset_handle_hotplug() invocation, all cpusets share the same view
        of what resources are available, and cpuset_handle_hotplug() can
        handle multiple resources going up and down.  These properties will
        allow async operation.
      
      The reorganization, while drastic, is equivalent and shouldn't cause
      any behavior difference.  This will enable making hotplug handling
      async and remove get_online_cpus() -> cgroup_mutex nesting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      deb7aa30
    • T
      cpuset: cleanup cpuset[_can]_attach() · 4e4c9a14
      Tejun Heo 提交于
      cpuset_can_attach() prepare global variables cpus_attach and
      cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
      There is no reason to prepare in cpuset_can_attach().  The same
      information can be accessed from cpuset_attach().
      
      Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
      and make the global variables static ones inside cpuset_attach().
      
      With this change, there's no reason to keep
      cpuset_attach_nodemask_{from|to} global.  Move them inside
      cpuset_attach().  Unfortunately, we need to keep cpus_attach global as
      it can't be allocated from cpuset_attach().
      
      v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
          Russell.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      4e4c9a14
    • T
      cpuset: introduce cpuset_for_each_child() · ae8086ce
      Tejun Heo 提交于
      Instead of iterating cgroup->children directly, introduce and use
      cpuset_for_each_child() which wraps cgroup_for_each_child() and
      performs online check.  As it uses the generic iterator, it requires
      RCU read locking too.
      
      As cpuset is currently protected by cgroup_mutex, non-online cpusets
      aren't visible to all the iterations and this patch currently doesn't
      make any functional difference.  This will be used to de-couple cpuset
      locking from cgroup core.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ae8086ce
    • T
      cpuset: introduce CS_ONLINE · efeb77b2
      Tejun Heo 提交于
      Add CS_ONLINE which is set from css_online() and cleared from
      css_offline().  This will enable using generic cgroup iterator while
      allowing decoupling cpuset from cgroup internal locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      efeb77b2
    • T
      cpuset: introduce ->css_on/offline() · c8f699bb
      Tejun Heo 提交于
      Add cpuset_css_on/offline() and rearrange css init/exit such that,
      
      * Allocation and clearing to the default values happen in css_alloc().
        Allocation now uses kzalloc().
      
      * Config inheritance and registration happen in css_online().
      
      * css_offline() undoes what css_online() did.
      
      * css_free() frees.
      
      This doesn't introduce any visible behavior changes.  This will help
      cleaning up locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c8f699bb
    • T
      cpuset: remove fast exit path from remove_tasks_in_empty_cpuset() · 0772324a
      Tejun Heo 提交于
      The function isn't that hot, the overhead of missing the fast exit is
      low, the test itself depends heavily on cgroup internals, and it's
      gonna be a hindrance when trying to decouple cpuset locking from
      cgroup core.  Remove the fast exit path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0772324a
    • T
      cpuset: remove unused cpuset_unlock() · 01c889cf
      Tejun Heo 提交于
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      01c889cf
  2. 13 12月, 2012 1 次提交
  3. 20 11月, 2012 2 次提交
  4. 24 7月, 2012 4 次提交
  5. 02 4月, 2012 1 次提交
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
  6. 29 3月, 2012 1 次提交
  7. 28 3月, 2012 1 次提交
  8. 27 3月, 2012 1 次提交
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9
  9. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  10. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  11. 21 12月, 2011 1 次提交
    • D
      cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask · b246272e
      David Rientjes 提交于
      Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
      nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
      new set of allowed cpuset nodes where the two nodemasks, as a result of
      the remap, are now disjoint.
      
      c0ff7453 ("cpuset,mm: fix no node to alloc memory when changing
      cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
      nodes from changing for a thread.  This causes any update to a set of
      allowed nodes to stall until put_mems_allowed() is called.
      
      This stall is unncessary, however, if at least one node remains unchanged
      in the update to the set of allowed nodes.  This was addressed by
      89e8a244 ("cpusets: avoid looping when storing to mems_allowed if one
      node remains set"), but it's still possible that an empty nodemask may be
      read from a mempolicy because the old nodemask may be remapped to the new
      nodemask during rebind.  To prevent this, only avoid the stall if there is
      no mempolicy for the thread being changed.
      
      This is a temporary solution until all reads from mempolicy nodemasks can
      be guaranteed to not be empty without the get_mems_allowed()
      synchronization.
      
      Also moves the check for nodemask intersection inside task_lock() so that
      tsk->mems_allowed cannot change.  This ensures that nothing can set this
      tsk's mems_allowed out from under us and also protects tsk->mempolicy.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b246272e
  12. 13 12月, 2011 3 次提交
  13. 03 11月, 2011 1 次提交
    • D
      cpusets: avoid looping when storing to mems_allowed if one node remains set · 89e8a244
      David Rientjes 提交于
      {get,put}_mems_allowed() exist so that general kernel code may locklessly
      access a task's set of allowable nodes without having the chance that a
      concurrent write will cause the nodemask to be empty on configurations
      where MAX_NUMNODES > BITS_PER_LONG.
      
      This could incur a significant delay, however, especially in low memory
      conditions because the page allocator is blocking and reclaim requires
      get_mems_allowed() itself.  It is not atypical to see writes to
      cpuset.mems take over 2 seconds to complete, for example.  In low memory
      conditions, this is problematic because it's one of the most imporant
      times to change cpuset.mems in the first place!
      
      The only way a task's set of allowable nodes may change is through cpusets
      by writing to cpuset.mems and when attaching a task to a generic code is
      not reading the nodemask with get_mems_allowed() at the same time, and
      then clearing all the old nodes.  This prevents the possibility that a
      reader will see an empty nodemask at the same time the writer is storing a
      new nodemask.
      
      If at least one node remains unchanged, though, it's possible to simply
      set all new nodes and then clear all the old nodes.  Changing a task's
      nodemask is protected by cgroup_mutex so it's guaranteed that two threads
      are not changing the same task's nodemask at the same time, so the
      nodemask is guaranteed to be stored before another thread changes it and
      determines whether a node remains set or not.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Paul Menage <paul@paulmenage.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e8a244
  14. 31 10月, 2011 1 次提交
  15. 27 7月, 2011 2 次提交
  16. 28 5月, 2011 1 次提交
  17. 27 5月, 2011 2 次提交
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  18. 11 4月, 2011 1 次提交
  19. 24 3月, 2011 4 次提交
  20. 05 3月, 2011 1 次提交
  21. 29 10月, 2010 1 次提交