1. 30 4月, 2013 1 次提交
  2. 20 3月, 2013 1 次提交
    • T
      sched: replace PF_THREAD_BOUND with PF_NO_SETAFFINITY · 14a40ffc
      Tejun Heo 提交于
      PF_THREAD_BOUND was originally used to mark kernel threads which were
      bound to a specific CPU using kthread_bind() and a task with the flag
      set allows cpus_allowed modifications only to itself.  Workqueue is
      currently abusing it to prevent userland from meddling with
      cpus_allowed of workqueue workers.
      
      What we need is a flag to prevent userland from messing with
      cpus_allowed of certain kernel tasks.  In kernel, anyone can
      (incorrectly) squash the flag, and, for worker-type usages,
      restricting cpus_allowed modification to the task itself doesn't
      provide meaningful extra proection as other tasks can inject work
      items to the task anyway.
      
      This patch replaces PF_THREAD_BOUND with PF_NO_SETAFFINITY.
      sched_setaffinity() checks the flag and return -EINVAL if set.
      set_cpus_allowed_ptr() is no longer affected by the flag.
      
      This will allow simplifying workqueue worker CPU affinity management.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      14a40ffc
  3. 19 2月, 2013 1 次提交
  4. 16 1月, 2013 2 次提交
    • L
      cpuset: drop spurious retval assignment in proc_cpuset_show() · d127027b
      Li Zefan 提交于
      proc_cpuset_show() has a spurious -EINVAL assignment which does
      nothing.  Remove it.
      
      This patch doesn't make any functional difference.
      
      tj: Rewrote patch description.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      d127027b
    • L
      cpuset: fix RCU lockdep splat · 27e89ae5
      Li Zefan 提交于
      5d21cc2d ("cpuset: replace
      cgroup_mutex locking with cpuset internal locking") incorrectly
      converted proc_cpuset_show() from cgroup_lock() to cpuset_mutex.
      proc_cpuset_show() is accessing cgroup hierarchy proper to determine
      cgroup path which can't be protected by cpuset_mutex.  This triggered
      the following RCU warning.
      
       ===============================
       [ INFO: suspicious RCU usage. ]
       3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262 Tainted: G        W
       -------------------------------
       include/linux/cgroup.h:534 suspicious rcu_dereference_check() usage!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 1, debug_locks = 1
       2 locks held by trinity/7514:
        #0:  (&p->lock){+.+.+.}, at: [<ffffffff812b06aa>] seq_read+0x3a/0x3e0
        #1:  (cpuset_mutex){+.+...}, at: [<ffffffff811abae4>] proc_cpuset_show+0x84/0x190
      
       stack backtrace:
       Pid: 7514, comm: trinity Tainted: G        W
      +3.8.0-rc3-next-20130114-sasha-00016-ga107525-dirty #262
       Call Trace:
        [<ffffffff81182cab>] lockdep_rcu_suspicious+0x10b/0x120
        [<ffffffff811abb71>] proc_cpuset_show+0x111/0x190
        [<ffffffff812b0827>] seq_read+0x1b7/0x3e0
        [<ffffffff812b0670>] ? seq_lseek+0x110/0x110
        [<ffffffff8128b4fb>] do_loop_readv_writev+0x4b/0x90
        [<ffffffff8128b776>] do_readv_writev+0xf6/0x1d0
        [<ffffffff8128b8ee>] vfs_readv+0x3e/0x60
        [<ffffffff8128b960>] sys_readv+0x50/0xd0
        [<ffffffff83d33d18>] tracesys+0xe1/0xe6
      
      The operation can be performed under RCU read lock.  Replace
      cpuset_mutex locking with RCU read locking.
      
      tj: Rewrote patch description.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      27e89ae5
  5. 08 1月, 2013 15 次提交
    • T
      cpuset: remove cpuset->parent · c431069f
      Tejun Heo 提交于
      cgroup already tracks the hierarchy.  Follow cgroup->parent to find
      the parent and drop cpuset->parent.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c431069f
    • T
      cpuset: replace cpuset->stack_list with cpuset_for_each_descendant_pre() · fc560a26
      Tejun Heo 提交于
      Implement cpuset_for_each_descendant_pre() and replace the
      cpuset-specific tree walking using cpuset->stack_list with it.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      fc560a26
    • T
      cpuset: replace cgroup_mutex locking with cpuset internal locking · 5d21cc2d
      Tejun Heo 提交于
      Supposedly for historical reasons, cpuset depends on cgroup core for
      locking.  It depends on cgroup_mutex in cgroup callbacks and grabs
      cgroup_mutex from other places where it wants to be synchronized.
      This is majorly messy and highly prone to introducing circular locking
      dependency especially because cgroup_mutex is supposed to be one of
      the outermost locks.
      
      As previous patches already plugged possible races which may happen by
      decoupling from cgroup_mutex, replacing cgroup_mutex with cpuset
      specific cpuset_mutex is mostly straight-forward.  Introduce
      cpuset_mutex, replace all occurrences of cgroup_mutex with it, and add
      cpuset_mutex locking to places which inherited cgroup_mutex from
      cgroup core.
      
      The only complication is from cpuset wanting to initiate task
      migration when a cpuset loses all cpus or memory nodes.  Task
      migration may go through full cgroup and all subsystem locking and
      should be initiated without holding any cpuset specific lock; however,
      a previous patch already made hotplug handled asynchronously and
      moving the task migration part outside other locks is easy.
      cpuset_propagate_hotplug_workfn() now invokes
      remove_tasks_in_empty_cpuset() without holding any lock.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5d21cc2d
    • T
      cpuset: schedule hotplug propagation from cpuset_attach() if the cpuset is empty · 02bb5863
      Tejun Heo 提交于
      cpuset is scheduled to be decoupled from cgroup_lock which will make
      hotplug handling race with task migration.  cpus or mems will be
      allowed to go offline between ->can_attach() and ->attach().  If
      hotplug takes down all cpus or mems of a cpuset while attach is in
      progress, ->attach() may end up putting tasks into an empty cpuset.
      
      This patchset makes ->attach() schedule hotplug propagation if the
      cpuset is empty after attaching is complete.  This will move the tasks
      to the nearest ancestor which can execute and the end result would be
      as if hotplug handling happened after the tasks finished attaching.
      
      cpuset_write_resmask() now also flushes cpuset_propagate_hotplug_wq to
      wait for propagations scheduled directly by cpuset_attach().
      
      This currently doesn't make any functional difference as everything is
      protected by cgroup_mutex but enables decoupling the locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      02bb5863
    • T
      cpuset: pin down cpus and mems while a task is being attached · 452477fa
      Tejun Heo 提交于
      cpuset is scheduled to be decoupled from cgroup_lock which will make
      configuration updates race with task migration.  Any config update
      will be allowed to happen between ->can_attach() and ->attach().  If
      such config update removes either all cpus or mems, by the time
      ->attach() is called, the condition verified by ->can_attach(), that
      the cpuset is capable of hosting the tasks, is no longer true.
      
      This patch adds cpuset->attach_in_progress which is incremented from
      ->can_attach() and decremented when the attach operation finishes
      either successfully or not.  validate_change() treats cpusets w/
      non-zero ->attach_in_progress like cpusets w/ tasks and refuses to
      remove all cpus or mems from it.
      
      This currently doesn't make any functional difference as everything is
      protected by cgroup_mutex but enables decoupling the locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      452477fa
    • T
      cpuset: make CPU / memory hotplug propagation asynchronous · 8d033948
      Tejun Heo 提交于
      cpuset_hotplug_workfn() has been invoking cpuset_propagate_hotplug()
      directly to propagate hotplug updates to !root cpusets; however, this
      has the following problems.
      
      * cpuset locking is scheduled to be decoupled from cgroup_mutex,
        cgroup_mutex will be unexported, and cgroup_attach_task() will do
        cgroup locking internally, so propagation can't synchronously move
        tasks to a parent cgroup while walking the hierarchy.
      
      * We can't use cgroup generic tree iterator because propagation to
        each cpuset may sleep.  With propagation done asynchronously, we can
        lose the rather ugly cpuset specific iteration.
      
      Convert cpuset_propagate_hotplug() to
      cpuset_propagate_hotplug_workfn() and execute it from newly added
      cpuset->hotplug_work.  The work items are run on an ordered workqueue,
      so the propagation order is preserved.  cpuset_hotplug_workfn()
      schedules all propagations while holding cgroup_mutex and waits for
      completion without cgroup_mutex.  Each in-flight propagation holds a
      reference to the cpuset->css.
      
      This patch doesn't cause any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8d033948
    • T
      cpuset: drop async_rebuild_sched_domains() · 699140ba
      Tejun Heo 提交于
      In general, we want to make cgroup_mutex one of the outermost locks
      and be able to use get_online_cpus() and friends from cgroup methods.
      With cpuset hotplug made async, get_online_cpus() can now be nested
      inside cgroup_mutex.
      
      Currently, cpuset avoids nesting get_online_cpus() inside cgroup_mutex
      by bouncing sched_domain rebuilding to a work item.  As such nesting
      is allowed now, remove the workqueue bouncing code and always rebuild
      sched_domains synchronously.  This also nests sched_domains_mutex
      inside cgroup_mutex, which is intended and should be okay.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      699140ba
    • T
      cpuset: don't nest cgroup_mutex inside get_online_cpus() · 3a5a6d0c
      Tejun Heo 提交于
      CPU / memory hotplug path currently grabs cgroup_mutex from hotplug
      event notifications.  We want to separate cpuset locking from cgroup
      core and make cgroup_mutex outer to hotplug synchronization so that,
      among other things, mechanisms which depend on get_online_cpus() can
      be used from cgroup callbacks.  In general, we want to keep
      cgroup_mutex the outermost lock to minimize locking interactions among
      different controllers.
      
      Convert cpuset_handle_hotplug() to cpuset_hotplug_workfn() and
      schedule it from the hotplug notifications.  As the function can
      already handle multiple mixed events without any input, converting it
      to a work function is mostly trivial; however, one complication is
      that cpuset_update_active_cpus() needs to update sched domains
      synchronously to reflect an offlined cpu to avoid confusing the
      scheduler.  This is worked around by falling back to the the default
      single sched domain synchronously before scheduling the actual hotplug
      work.  This makes sched domain rebuilt twice per CPU hotplug event but
      the operation isn't that heavy and a lot of the second operation would
      be noop for systems w/ single sched domain, which is the common case.
      
      This decouples cpuset hotplug handling from the notification callbacks
      and there can be an arbitrary delay between the actual event and
      updates to cpusets.  Scheduler and mm can handle it fine but moving
      tasks out of an empty cpuset may race against writes to the cpuset
      restoring execution resources which can lead to confusing behavior.
      Flush hotplug work item from cpuset_write_resmask() to avoid such
      confusions.
      
      v2: Synchronous sched domain rebuilding using the fallback sched
          domain added.  This fixes various issues caused by confused
          scheduler putting tasks on a dead CPU, including the one reported
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      3a5a6d0c
    • T
      cpuset: reorganize CPU / memory hotplug handling · deb7aa30
      Tejun Heo 提交于
      Reorganize hotplug path to prepare for async hotplug handling.
      
      * Both CPU and memory hotplug handlings are collected into a single
        function - cpuset_handle_hotplug().  It doesn't take any argument
        but compares the current setttings of top_cpuset against what's
        actually available to determine what happened.  This function
        directly updates top_cpuset.  If there are CPUs or memory nodes
        which are taken down, cpuset_propagate_hotplug() in invoked on all
        !root cpusets.
      
      * cpuset_propagate_hotplug() is responsible for updating the specified
        cpuset so that it doesn't include any resource which isn't available
        to top_cpuset.  If no CPU or memory is left after update, all tasks
        are moved to the nearest ancestor with both resources.
      
      * update_tasks_cpumask() and update_tasks_nodemask() are now always
        called after cpus or mems masks are updated even if the cpuset
        doesn't have any task.  This is for brevity and not expected to have
        any measureable effect.
      
      * cpu_active_mask and N_HIGH_MEMORY are read exactly once per
        cpuset_handle_hotplug() invocation, all cpusets share the same view
        of what resources are available, and cpuset_handle_hotplug() can
        handle multiple resources going up and down.  These properties will
        allow async operation.
      
      The reorganization, while drastic, is equivalent and shouldn't cause
      any behavior difference.  This will enable making hotplug handling
      async and remove get_online_cpus() -> cgroup_mutex nesting.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      deb7aa30
    • T
      cpuset: cleanup cpuset[_can]_attach() · 4e4c9a14
      Tejun Heo 提交于
      cpuset_can_attach() prepare global variables cpus_attach and
      cpuset_attach_nodemask_{to|from} which are used by cpuset_attach().
      There is no reason to prepare in cpuset_can_attach().  The same
      information can be accessed from cpuset_attach().
      
      Move the prepartion logic from cpuset_can_attach() to cpuset_attach()
      and make the global variables static ones inside cpuset_attach().
      
      With this change, there's no reason to keep
      cpuset_attach_nodemask_{from|to} global.  Move them inside
      cpuset_attach().  Unfortunately, we need to keep cpus_attach global as
      it can't be allocated from cpuset_attach().
      
      v2: cpus_attach not converted to cpumask_t as per Li Zefan and Rusty
          Russell.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      4e4c9a14
    • T
      cpuset: introduce cpuset_for_each_child() · ae8086ce
      Tejun Heo 提交于
      Instead of iterating cgroup->children directly, introduce and use
      cpuset_for_each_child() which wraps cgroup_for_each_child() and
      performs online check.  As it uses the generic iterator, it requires
      RCU read locking too.
      
      As cpuset is currently protected by cgroup_mutex, non-online cpusets
      aren't visible to all the iterations and this patch currently doesn't
      make any functional difference.  This will be used to de-couple cpuset
      locking from cgroup core.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      ae8086ce
    • T
      cpuset: introduce CS_ONLINE · efeb77b2
      Tejun Heo 提交于
      Add CS_ONLINE which is set from css_online() and cleared from
      css_offline().  This will enable using generic cgroup iterator while
      allowing decoupling cpuset from cgroup internal locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      efeb77b2
    • T
      cpuset: introduce ->css_on/offline() · c8f699bb
      Tejun Heo 提交于
      Add cpuset_css_on/offline() and rearrange css init/exit such that,
      
      * Allocation and clearing to the default values happen in css_alloc().
        Allocation now uses kzalloc().
      
      * Config inheritance and registration happen in css_online().
      
      * css_offline() undoes what css_online() did.
      
      * css_free() frees.
      
      This doesn't introduce any visible behavior changes.  This will help
      cleaning up locking.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c8f699bb
    • T
      cpuset: remove fast exit path from remove_tasks_in_empty_cpuset() · 0772324a
      Tejun Heo 提交于
      The function isn't that hot, the overhead of missing the fast exit is
      low, the test itself depends heavily on cgroup internals, and it's
      gonna be a hindrance when trying to decouple cpuset locking from
      cgroup core.  Remove the fast exit path.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      0772324a
    • T
      cpuset: remove unused cpuset_unlock() · 01c889cf
      Tejun Heo 提交于
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      01c889cf
  6. 13 12月, 2012 1 次提交
  7. 20 11月, 2012 2 次提交
  8. 24 7月, 2012 4 次提交
  9. 02 4月, 2012 1 次提交
    • T
      cgroup: convert all non-memcg controllers to the new cftype interface · 4baf6e33
      Tejun Heo 提交于
      Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio,
      net_cls and device controllers to use the new cftype based interface.
      Termination entry is added to cftype arrays and populate callbacks are
      replaced with cgroup_subsys->base_cftypes initializations.
      
      This is functionally identical transformation.  There shouldn't be any
      visible behavior change.
      
      memcg is rather special and will be converted separately.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      4baf6e33
  10. 29 3月, 2012 1 次提交
  11. 28 3月, 2012 1 次提交
  12. 27 3月, 2012 1 次提交
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9
  13. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  14. 03 2月, 2012 1 次提交
    • L
      cgroup: remove cgroup_subsys argument from callbacks · 761b3ef5
      Li Zefan 提交于
      The argument is not used at all, and it's not necessary, because
      a specific callback handler of course knows which subsys it
      belongs to.
      
      Now only ->pupulate() takes this argument, because the handlers of
      this callback always call cgroup_add_file()/cgroup_add_files().
      
      So we reduce a few lines of code, though the shrinking of object size
      is minimal.
      
       16 files changed, 113 insertions(+), 162 deletions(-)
      
         text    data     bss     dec     hex filename
      5486240  656987 7039960 13183187         c928d3 vmlinux.o.orig
      5486170  656987 7039960 13183117         c9288d vmlinux.o
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      761b3ef5
  15. 21 12月, 2011 1 次提交
    • D
      cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask · b246272e
      David Rientjes 提交于
      Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
      nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
      new set of allowed cpuset nodes where the two nodemasks, as a result of
      the remap, are now disjoint.
      
      c0ff7453 ("cpuset,mm: fix no node to alloc memory when changing
      cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
      nodes from changing for a thread.  This causes any update to a set of
      allowed nodes to stall until put_mems_allowed() is called.
      
      This stall is unncessary, however, if at least one node remains unchanged
      in the update to the set of allowed nodes.  This was addressed by
      89e8a244 ("cpusets: avoid looping when storing to mems_allowed if one
      node remains set"), but it's still possible that an empty nodemask may be
      read from a mempolicy because the old nodemask may be remapped to the new
      nodemask during rebind.  To prevent this, only avoid the stall if there is
      no mempolicy for the thread being changed.
      
      This is a temporary solution until all reads from mempolicy nodemasks can
      be guaranteed to not be empty without the get_mems_allowed()
      synchronization.
      
      Also moves the check for nodemask intersection inside task_lock() so that
      tsk->mems_allowed cannot change.  This ensures that nothing can set this
      tsk's mems_allowed out from under us and also protects tsk->mempolicy.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b246272e
  16. 13 12月, 2011 3 次提交
  17. 03 11月, 2011 1 次提交
    • D
      cpusets: avoid looping when storing to mems_allowed if one node remains set · 89e8a244
      David Rientjes 提交于
      {get,put}_mems_allowed() exist so that general kernel code may locklessly
      access a task's set of allowable nodes without having the chance that a
      concurrent write will cause the nodemask to be empty on configurations
      where MAX_NUMNODES > BITS_PER_LONG.
      
      This could incur a significant delay, however, especially in low memory
      conditions because the page allocator is blocking and reclaim requires
      get_mems_allowed() itself.  It is not atypical to see writes to
      cpuset.mems take over 2 seconds to complete, for example.  In low memory
      conditions, this is problematic because it's one of the most imporant
      times to change cpuset.mems in the first place!
      
      The only way a task's set of allowable nodes may change is through cpusets
      by writing to cpuset.mems and when attaching a task to a generic code is
      not reading the nodemask with get_mems_allowed() at the same time, and
      then clearing all the old nodes.  This prevents the possibility that a
      reader will see an empty nodemask at the same time the writer is storing a
      new nodemask.
      
      If at least one node remains unchanged, though, it's possible to simply
      set all new nodes and then clear all the old nodes.  Changing a task's
      nodemask is protected by cgroup_mutex so it's guaranteed that two threads
      are not changing the same task's nodemask at the same time, so the
      nodemask is guaranteed to be stored before another thread changes it and
      determines whether a node remains set or not.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Paul Menage <paul@paulmenage.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e8a244
  18. 31 10月, 2011 1 次提交
  19. 27 7月, 2011 1 次提交