1. 13 9月, 2013 4 次提交
  2. 10 9月, 2013 1 次提交
  3. 02 9月, 2013 9 次提交
  4. 16 8月, 2013 2 次提交
  5. 14 8月, 2013 6 次提交
    • F
      vtime: Always debug check snapshot source _before_ updating it · af2350bd
      Frederic Weisbecker 提交于
      The vtime delta update performed by get_vtime_delta() always check
      that the source of the snapshot is valid.
      
      Meanhile the snapshot updaters that rely on get_vtime_delta() also
      set the new snapshot origin. But some of them do this right before
      the call to get_vtime_delta(), making its debug check useless.
      
      This is easily fixable by moving the snapshot origin update after
      the call to get_vtime_delta(). The order doesn't matter there.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      af2350bd
    • F
      vtime: Always scale generic vtime accounting results · b854fafa
      Frederic Weisbecker 提交于
      The cputime accounting in full dynticks can be a subtle
      mixup of CPUs using tick based accounting and others using
      generic vtime.
      
      As long as the tick can have a share on producing these stats, we
      want to scale the result against CFS precise accounting as the tick
      can miss some task hiding between the periodic interrupt.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      b854fafa
    • F
      vtime: Optimize full dynticks accounting off case with static keys · b0493406
      Frederic Weisbecker 提交于
      If no CPU is in the full dynticks range, we can avoid the full
      dynticks cputime accounting through generic vtime along with its
      overhead and use the traditional tick based accounting instead.
      
      Let's do this and nope the off case with static keys.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      b0493406
    • F
      vtime: Fix racy cputime delta update · 54461562
      Frederic Weisbecker 提交于
      get_vtime_delta() must be called under the task vtime_seqlock
      with the code that does the cputime accounting flush.
      
      Otherwise the cputime reader can be fooled and run into
      a race where it sees the snapshot update but misses the
      cputime flush. As a result it can report a cputime that is
      way too short.
      
      Fix vtime_account_user() that wasn't complying to that rule.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      54461562
    • F
      vtime: Remove a few unneeded generic vtime state checks · 7621d1f8
      Frederic Weisbecker 提交于
      Some generic vtime APIs check if the vtime accounting
      is enabled on the local CPU before doing their work.
      
      Some of these are not needed because all their callers already
      take care of that. Let's remove the checks on these.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      7621d1f8
    • F
      context_tracking: Optimize guest APIs off case with static key · 48d6a816
      Frederic Weisbecker 提交于
      Optimize guest entry/exit APIs with static keys. This minimize
      the overhead for those who enable CONFIG_NO_HZ_FULL without
      always using it. Having no range passed to nohz_full= should
      result in the probes overhead to be minimized.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      48d6a816
  6. 13 8月, 2013 3 次提交
    • O
      sched: fix the theoretical signal_wake_up() vs schedule() race · e0acd0a6
      Oleg Nesterov 提交于
      This is only theoretical, but after try_to_wake_up(p) was changed
      to check p->state under p->pi_lock the code like
      
      	__set_current_state(TASK_INTERRUPTIBLE);
      	schedule();
      
      can miss a signal. This is the special case of wait-for-condition,
      it relies on try_to_wake_up/schedule interaction and thus it does
      not need mb() between __set_current_state() and if(signal_pending).
      
      However, this __set_current_state() can move into the critical
      section protected by rq->lock, now that try_to_wake_up() takes
      another lock we need to ensure that it can't be reordered with
      "if (signal_pending(current))" check inside that section.
      
      The patch is actually one-liner, it simply adds smp_wmb() before
      spin_lock_irq(rq->lock). This is what try_to_wake_up() already
      does by the same reason.
      
      We turn this wmb() into the new helper, smp_mb__before_spinlock(),
      for better documentation and to allow the architectures to change
      the default implementation.
      
      While at it, kill smp_mb__after_lock(), it has no callers.
      
      Perhaps we can also add smp_mb__before/after_spinunlock() for
      prepare_to_wait().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0acd0a6
    • F
      vtime: Update a few comments · 5b206d48
      Frederic Weisbecker 提交于
      Update a stale comment from the old vtime era and document some
      locking that might be non obvious.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kevin Hilman <khilman@linaro.org>
      5b206d48
    • F
      sched: Consolidate open coded preemptible() checks · fbb00b56
      Frederic Weisbecker 提交于
      preempt_schedule() and preempt_schedule_context() open
      code their preemptability checks.
      
      Use the standard API instead for consolidation.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Alex Shi <alex.shi@intel.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      fbb00b56
  7. 09 8月, 2013 6 次提交
    • T
      cgroup: make cgroup_taskset deal with cgroup_subsys_state instead of cgroup · d99c8727
      Tejun Heo 提交于
      cgroup is in the process of converting to css (cgroup_subsys_state)
      from cgroup as the principal subsystem interface handle.  This is
      mostly to prepare for the unified hierarchy support where css's will
      be created and destroyed dynamically but also helps cleaning up
      subsystem implementations as css is usually what they are interested
      in anyway.
      
      cgroup_taskset which is used by the subsystem attach methods is the
      last cgroup subsystem API which isn't using css as the handle.  Update
      cgroup_taskset_cur_cgroup() to cgroup_taskset_cur_css() and
      cgroup_taskset_for_each() to take @skip_css instead of @skip_cgrp.
      
      The conversions are pretty mechanical.  One exception is
      cpuset::cgroup_cs(), which lost its last user and got removed.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      d99c8727
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in file methods · 182446d0
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup.
      Please see the previous commit which converts the subsystem methods
      for rationale.
      
      This patch converts all cftype file operations to take @css instead of
      @cgroup.  cftypes for the cgroup core files don't have their subsytem
      pointer set.  These will automatically use the dummy_css added by the
      previous patch and can be converted the same way.
      
      Most subsystem conversions are straight forwards but there are some
      interesting ones.
      
      * freezer: update_if_frozen() is also converted to take @css instead
        of @cgroup for consistency.  This will make the code look simpler
        too once iterators are converted to use css.
      
      * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
        vmpressure while mem_cgroup_from_cont() can be made static.
        Updated accordingly.
      
      * cpu: cgroup_tg() doesn't have any user left.  Removed.
      
      * cpuacct: cgroup_ca() doesn't have any user left.  Removed.
      
      * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
        Removed.
      
      * net_cls: cgrp_cls_state() doesn't have any user left.  Removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      182446d0
    • T
      cgroup: pass around cgroup_subsys_state instead of cgroup in subsystem methods · eb95419b
      Tejun Heo 提交于
      cgroup is currently in the process of transitioning to using struct
      cgroup_subsys_state * as the primary handle instead of struct cgroup *
      in subsystem implementations for the following reasons.
      
      * With unified hierarchy, subsystems will be dynamically bound and
        unbound from cgroups and thus css's (cgroup_subsys_state) may be
        created and destroyed dynamically over the lifetime of a cgroup,
        which is different from the current state where all css's are
        allocated and destroyed together with the associated cgroup.  This
        in turn means that cgroup_css() should be synchronized and may
        return NULL, making it more cumbersome to use.
      
      * Differing levels of per-subsystem granularity in the unified
        hierarchy means that the task and descendant iterators should behave
        differently depending on the specific subsystem the iteration is
        being performed for.
      
      * In majority of the cases, subsystems only care about its part in the
        cgroup hierarchy - ie. the hierarchy of css's.  Subsystem methods
        often obtain the matching css pointer from the cgroup and don't
        bother with the cgroup pointer itself.  Passing around css fits
        much better.
      
      This patch converts all cgroup_subsys methods to take @css instead of
      @cgroup.  The conversions are mostly straight-forward.  A few
      noteworthy changes are
      
      * ->css_alloc() now takes css of the parent cgroup rather than the
        pointer to the new cgroup as the css for the new cgroup doesn't
        exist yet.  Knowing the parent css is enough for all the existing
        subsystems.
      
      * In kernel/cgroup.c::offline_css(), unnecessary open coded css
        dereference is replaced with local variable access.
      
      This patch shouldn't cause any behavior differences.
      
      v2: Unnecessary explicit cgrp->subsys[] deref in css_online() replaced
          with local variable @css as suggested by Li Zefan.
      
          Rebased on top of new for-3.12 which includes for-3.11-fixes so
          that ->css_free() invocation added by da0a12ca ("cgroup: fix a
          leak when percpu_ref_init() fails") is converted too.  Suggested
          by Li Zefan.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      eb95419b
    • T
      cgroup: add css_parent() · 63876986
      Tejun Heo 提交于
      Currently, controllers have to explicitly follow the cgroup hierarchy
      to find the parent of a given css.  cgroup is moving towards using
      cgroup_subsys_state as the main controller interface construct, so
      let's provide a way to climb the hierarchy using just csses.
      
      This patch implements css_parent() which, given a css, returns its
      parent.  The function is guarnateed to valid non-NULL parent css as
      long as the target css is not at the top of the hierarchy.
      
      freezer, cpuset, cpu, cpuacct, hugetlb, memory, net_cls and devices
      are converted to use css_parent() instead of accessing cgroup->parent
      directly.
      
      * __parent_ca() is dropped from cpuacct and its usage is replaced with
        parent_ca().  The only difference between the two was NULL test on
        cgroup->parent which is now embedded in css_parent() making the
        distinction moot.  Note that eventually a css->parent field will be
        added to css and the NULL check in css_parent() will go away.
      
      This patch shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      63876986
    • T
      cgroup: add/update accessors which obtain subsys specific data from css · a7c6d554
      Tejun Heo 提交于
      css (cgroup_subsys_state) is usually embedded in a subsys specific
      data structure.  Subsystems either use container_of() directly to cast
      from css to such data structure or has an accessor function wrapping
      such cast.  As cgroup as whole is moving towards using css as the main
      interface handle, add and update such accessors to ease dealing with
      css's.
      
      All accessors explicitly handle NULL input and return NULL in those
      cases.  While this looks like an extra branch in the code, as all
      controllers specific data structures have css as the first field, the
      casting doesn't involve any offsetting and the compiler can trivially
      optimize out the branch.
      
      * blkio, freezer, cpuset, cpu, cpuacct and net_cls didn't have such
        accessor.  Added.
      
      * memory, hugetlb and devices already had one but didn't explicitly
        handle NULL input.  Updated.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      a7c6d554
    • T
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo 提交于
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8af01f56
  8. 01 8月, 2013 1 次提交
  9. 31 7月, 2013 1 次提交
  10. 23 7月, 2013 3 次提交
    • P
      sched: Micro-optimize the smart wake-affine logic · 7d9ffa89
      Peter Zijlstra 提交于
      Smart wake-affine is using node-size as the factor currently, but the overhead
      of the mask operation is high.
      
      Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
      the highest cache-share domain size, and make it to be the new factor, in order
      to reduce the overhead and make it more reasonable.
      Tested-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7d9ffa89
    • M
      sched: Implement smarter wake-affine logic · 62470419
      Michael Wang 提交于
      The wake-affine scheduler feature is currently always trying to pull
      the wakee close to the waker. In theory this should be beneficial if
      the waker's CPU caches hot data for the wakee, and it's also beneficial
      in the extreme ping-pong high context switch rate case.
      
      Testing shows it can benefit hackbench up to 15%.
      
      However, the feature is somewhat blind, from which some workloads
      such as pgbench suffer. It's also time-consuming algorithmically.
      
      Testing shows it can damage pgbench up to 50% - far more than the
      benefit it brings in the best case.
      
      So wake-affine should be smarter and it should realize when to
      stop its thankless effort at trying to find a suitable CPU to wake on.
      
      This patch introduces 'wakee_flips', which will be increased each
      time the task flips (switches) its wakee target.
      
      So a high 'wakee_flips' value means the task has more than one
      wakee, and the bigger the number, the higher the wakeup frequency.
      
      Now when making the decision on whether to pull or not, pay attention to
      the wakee with a high 'wakee_flips', pulling such a task may benefit
      the wakee. Also imply that the waker will face cruel competition later,
      it could be very cruel or very fast depends on the story behind
      'wakee_flips', waker therefore suffers.
      
      Furthermore, if waker also has a high 'wakee_flips', that implies that
      multiple tasks rely on it, then waker's higher latency will damage all
      of them, so pulling wakee seems to be a bad deal.
      
      Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
      higher and higher, the cost of pulling seems to be worse and worse.
      
      The patch therefore helps the wake-affine feature to stop its pulling
      work when:
      
      	wakee->wakee_flips > factor &&
      	waker->wakee_flips > (factor * wakee->wakee_flips)
      
      The 'factor' here is the number of CPUs in the current CPU's NUMA node,
      so a bigger node will lead to more pulling since the trial becomes more
      severe.
      
      After applying the patch, pgbench shows up to 40% improvements and no regressions.
      
      Tested with 12 cpu x86 server and tip 3.10.0-rc7.
      
      The percentages in the final column highlight the areas with the biggest wins,
      all other areas improved as well:
      
      	pgbench		    base	smart
      
      	| db_size | clients |  tps  |	|  tps  |
      	+---------+---------+-------+   +-------+
      	| 22 MB   |       1 | 10598 |   | 10796 |
      	| 22 MB   |       2 | 21257 |   | 21336 |
      	| 22 MB   |       4 | 41386 |   | 41622 |
      	| 22 MB   |       8 | 51253 |   | 57932 |
      	| 22 MB   |      12 | 48570 |   | 54000 |
      	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
      	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
      	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
      	| 7484 MB |       1 |  8951 |   |  9193 |
      	| 7484 MB |       2 | 19233 |   | 19240 |
      	| 7484 MB |       4 | 37239 |   | 37302 |
      	| 7484 MB |       8 | 46087 |   | 50018 |
      	| 7484 MB |      12 | 42054 |   | 48763 |
      	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
      	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
      	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
      	| 15 GB   |       1 |  8845 |   |  9104 |
      	| 15 GB   |       2 | 19094 |   | 19162 |
      	| 15 GB   |       4 | 36979 |   | 36983 |
      	| 15 GB   |       8 | 46087 |   | 49977 |
      	| 15 GB   |      12 | 41901 |   | 48591 |
      	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
      	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
      	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      62470419
    • V
      sched: Move h_load calculation to task_h_load() · 68520796
      Vladimir Davydov 提交于
      The bad thing about update_h_load(), which computes hierarchical load
      factor for task groups, is that it is called for each task group in the
      system before every load balancer run, and since rebalance can be
      triggered very often, this function can eat really a lot of cpu time if
      there are many cpu cgroups in the system.
      
      Although the situation was improved significantly by commit a35b6466
      ('sched, cgroup: Reduce rq->lock hold times for large cgroup
      hierarchies'), the problem still can arise under some kinds of loads,
      e.g. when cpus are switching from idle to busy and back very frequently.
      
      For instance, when I start 1000 of processes that wake up every
      millisecond on my 8 cpus host, 'top' and 'perf top' show:
      
      Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 243K cycles
        7.57%  [kernel]               [k] __schedule
        7.08%  [kernel]               [k] timerqueue_add
        6.13%  libc-2.12.so           [.] usleep
      
      Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
      usage increases significantly although the 'wakers' are still executing
      in the root cpu cgroup:
      
      Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
      Events: 230K cycles
       24.56%  [kernel]            [k] tg_load_down
        5.76%  [kernel]            [k] __schedule
      
      This happens because this particular kind of load triggers 'new idle'
      rebalance very frequently, which requires calling update_h_load(),
      which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
      though it is absolutely useless, because idle cpu cgroups have no tasks
      to pull.
      
      This patch tries to improve the situation by making h_load calculation
      proceed only when h_load is really necessary. To achieve this, it
      substitutes update_h_load() with update_cfs_rq_h_load(), which computes
      h_load only for a given cfs_rq and all its ascendants, and makes the
      load balancer call this function whenever it considers if a task should
      be pulled, i.e. it moves h_load calculations directly to task_h_load().
      For h_load of the same cfs_rq not to be updated multiple times (in case
      several tasks in the same cgroup are considered during the same balance
      run), the patch keeps the time of the last h_load update for each cfs_rq
      and breaks calculation when it finds h_load to be uptodate.
      
      The benefit of it is that h_load is computed only for those cfs_rq's,
      which really need it, in particular all idle task groups are skipped.
      Although this, in fact, moves h_load calculation under rq lock, it
      should not affect latency much, because the amount of work done under rq
      lock while trying to pull tasks is limited by sched_nr_migrate.
      
      After the patch applied with the setup described above (1000 wakers in
      the root cgroup and 10000 idle cgroups), I get:
      
      Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 242K cycles
        7.57%  [kernel]                  [k] __schedule
        6.70%  [kernel]                  [k] timerqueue_add
        5.93%  libc-2.12.so              [.] usleep
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68520796
  11. 22 7月, 2013 1 次提交
  12. 18 7月, 2013 1 次提交
  13. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  14. 12 7月, 2013 1 次提交