1. 13 9月, 2013 3 次提交
    • P
      sched/fair: Reduce local_group logic · b72ff13c
      Peter Zijlstra 提交于
      Try and reduce the local_group logic by pulling most of it into
      update_sd_lb_stats.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-mgezl354xgyhiyrte78fdkpd@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b72ff13c
    • P
      sched/fair: Rewrite group_imb trigger · 6263322c
      Peter Zijlstra 提交于
      Change the group_imb detection from the old 'load-spike' detector to
      an actual imbalance detector. We set it from the lower domain balance
      pass when it fails to create a balance in the presence of task
      affinities.
      
      The advantage is that this should no longer generate the false
      positive group_imb conditions generated by transient load spikes from
      the normal balancing/bulk-wakeup etc. behaviour.
      
      While I haven't actually observed those they could happen.
      
      I'm not entirely happy with this patch; it somehow feels a little
      fragile.
      
      Nor does it solve the biggest issue I have with the group_imb code; it
      it still a fragile construct in that once we 'fixed' the imbalance
      we'll not detect the group_imb again and could end up re-creating it.
      
      That said, this patch does seem to preserve behaviour for the
      described degenerate case. In particular on my 2*6*2 wsm-ep:
      
        taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      ends up with 9 spinners, each on their own CPU; whereas if you disable
      the group_imb code that typically doesn't happen (you'll get one pair
      sharing a CPU most of the time).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6263322c
    • D
      sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones · 6c9a27f5
      Daisuke Nishimura 提交于
      There is a small race between copy_process() and cgroup_attach_task()
      where child->se.parent,cfs_rq points to invalid (old) ones.
      
              parent doing fork()      | someone moving the parent to another cgroup
        -------------------------------+---------------------------------------------
          copy_process()
            + dup_task_struct()
              -> parent->se is copied to child->se.
                 se.parent,cfs_rq of them point to old ones.
      
                                           cgroup_attach_task()
                                             + cgroup_task_migrate()
                                               -> parent->cgroup is updated.
                                             + cpu_cgroup_attach()
                                               + sched_move_task()
                                                 + task_move_group_fair()
                                                   +- set_task_rq()
                                                      -> se.parent,cfs_rq of parent
                                                         are updated.
      
            + cgroup_fork()
              -> parent->cgroup is copied to child->cgroup. (*1)
            + sched_fork()
              + task_fork_fair()
                -> se.parent,cfs_rq of child are accessed
                   while they point to old ones. (*2)
      
      In the worst case, this bug can lead to "use-after-free" and cause a panic,
      because it's new cgroup's refcount that is incremented at (*1),
      so the old cgroup(and related data) can be freed before (*2).
      
      In fact, a panic caused by this bug was originally caught in RHEL6.4.
      
          BUG: unable to handle kernel NULL pointer dereference at (null)
          IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0
          [...]
          Call Trace:
           [<ffffffff81051f25>] place_entity+0x75/0xa0
           [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160
           [<ffffffff81063c0b>] sched_fork+0x6b/0x140
           [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450
           [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130
           [<ffffffff8106d2f4>] do_fork+0x94/0x460
           [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100
           [<ffffffff81009598>] sys_clone+0x28/0x30
           [<ffffffff8100b393>] stub_clone+0x13/0x20
           [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c9a27f5
  2. 10 9月, 2013 1 次提交
  3. 02 9月, 2013 8 次提交
  4. 01 8月, 2013 1 次提交
  5. 31 7月, 2013 1 次提交
  6. 23 7月, 2013 3 次提交
    • P
      sched: Micro-optimize the smart wake-affine logic · 7d9ffa89
      Peter Zijlstra 提交于
      Smart wake-affine is using node-size as the factor currently, but the overhead
      of the mask operation is high.
      
      Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
      the highest cache-share domain size, and make it to be the new factor, in order
      to reduce the overhead and make it more reasonable.
      Tested-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7d9ffa89
    • M
      sched: Implement smarter wake-affine logic · 62470419
      Michael Wang 提交于
      The wake-affine scheduler feature is currently always trying to pull
      the wakee close to the waker. In theory this should be beneficial if
      the waker's CPU caches hot data for the wakee, and it's also beneficial
      in the extreme ping-pong high context switch rate case.
      
      Testing shows it can benefit hackbench up to 15%.
      
      However, the feature is somewhat blind, from which some workloads
      such as pgbench suffer. It's also time-consuming algorithmically.
      
      Testing shows it can damage pgbench up to 50% - far more than the
      benefit it brings in the best case.
      
      So wake-affine should be smarter and it should realize when to
      stop its thankless effort at trying to find a suitable CPU to wake on.
      
      This patch introduces 'wakee_flips', which will be increased each
      time the task flips (switches) its wakee target.
      
      So a high 'wakee_flips' value means the task has more than one
      wakee, and the bigger the number, the higher the wakeup frequency.
      
      Now when making the decision on whether to pull or not, pay attention to
      the wakee with a high 'wakee_flips', pulling such a task may benefit
      the wakee. Also imply that the waker will face cruel competition later,
      it could be very cruel or very fast depends on the story behind
      'wakee_flips', waker therefore suffers.
      
      Furthermore, if waker also has a high 'wakee_flips', that implies that
      multiple tasks rely on it, then waker's higher latency will damage all
      of them, so pulling wakee seems to be a bad deal.
      
      Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
      higher and higher, the cost of pulling seems to be worse and worse.
      
      The patch therefore helps the wake-affine feature to stop its pulling
      work when:
      
      	wakee->wakee_flips > factor &&
      	waker->wakee_flips > (factor * wakee->wakee_flips)
      
      The 'factor' here is the number of CPUs in the current CPU's NUMA node,
      so a bigger node will lead to more pulling since the trial becomes more
      severe.
      
      After applying the patch, pgbench shows up to 40% improvements and no regressions.
      
      Tested with 12 cpu x86 server and tip 3.10.0-rc7.
      
      The percentages in the final column highlight the areas with the biggest wins,
      all other areas improved as well:
      
      	pgbench		    base	smart
      
      	| db_size | clients |  tps  |	|  tps  |
      	+---------+---------+-------+   +-------+
      	| 22 MB   |       1 | 10598 |   | 10796 |
      	| 22 MB   |       2 | 21257 |   | 21336 |
      	| 22 MB   |       4 | 41386 |   | 41622 |
      	| 22 MB   |       8 | 51253 |   | 57932 |
      	| 22 MB   |      12 | 48570 |   | 54000 |
      	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
      	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
      	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
      	| 7484 MB |       1 |  8951 |   |  9193 |
      	| 7484 MB |       2 | 19233 |   | 19240 |
      	| 7484 MB |       4 | 37239 |   | 37302 |
      	| 7484 MB |       8 | 46087 |   | 50018 |
      	| 7484 MB |      12 | 42054 |   | 48763 |
      	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
      	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
      	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
      	| 15 GB   |       1 |  8845 |   |  9104 |
      	| 15 GB   |       2 | 19094 |   | 19162 |
      	| 15 GB   |       4 | 36979 |   | 36983 |
      	| 15 GB   |       8 | 46087 |   | 49977 |
      	| 15 GB   |      12 | 41901 |   | 48591 |
      	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
      	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
      	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      62470419
    • V
      sched: Move h_load calculation to task_h_load() · 68520796
      Vladimir Davydov 提交于
      The bad thing about update_h_load(), which computes hierarchical load
      factor for task groups, is that it is called for each task group in the
      system before every load balancer run, and since rebalance can be
      triggered very often, this function can eat really a lot of cpu time if
      there are many cpu cgroups in the system.
      
      Although the situation was improved significantly by commit a35b6466
      ('sched, cgroup: Reduce rq->lock hold times for large cgroup
      hierarchies'), the problem still can arise under some kinds of loads,
      e.g. when cpus are switching from idle to busy and back very frequently.
      
      For instance, when I start 1000 of processes that wake up every
      millisecond on my 8 cpus host, 'top' and 'perf top' show:
      
      Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 243K cycles
        7.57%  [kernel]               [k] __schedule
        7.08%  [kernel]               [k] timerqueue_add
        6.13%  libc-2.12.so           [.] usleep
      
      Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
      usage increases significantly although the 'wakers' are still executing
      in the root cpu cgroup:
      
      Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
      Events: 230K cycles
       24.56%  [kernel]            [k] tg_load_down
        5.76%  [kernel]            [k] __schedule
      
      This happens because this particular kind of load triggers 'new idle'
      rebalance very frequently, which requires calling update_h_load(),
      which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
      though it is absolutely useless, because idle cpu cgroups have no tasks
      to pull.
      
      This patch tries to improve the situation by making h_load calculation
      proceed only when h_load is really necessary. To achieve this, it
      substitutes update_h_load() with update_cfs_rq_h_load(), which computes
      h_load only for a given cfs_rq and all its ascendants, and makes the
      load balancer call this function whenever it considers if a task should
      be pulled, i.e. it moves h_load calculations directly to task_h_load().
      For h_load of the same cfs_rq not to be updated multiple times (in case
      several tasks in the same cgroup are considered during the same balance
      run), the patch keeps the time of the last h_load update for each cfs_rq
      and breaks calculation when it finds h_load to be uptodate.
      
      The benefit of it is that h_load is computed only for those cfs_rq's,
      which really need it, in particular all idle task groups are skipped.
      Although this, in fact, moves h_load calculation under rq lock, it
      should not affect latency much, because the amount of work done under rq
      lock while trying to pull tasks is limited by sched_nr_migrate.
      
      After the patch applied with the setup described above (1000 wakers in
      the root cgroup and 10000 idle cgroups), I get:
      
      Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 242K cycles
        7.57%  [kernel]                  [k] __schedule
        6.70%  [kernel]                  [k] timerqueue_add
        5.93%  libc-2.12.so              [.] usleep
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68520796
  7. 22 7月, 2013 1 次提交
  8. 18 7月, 2013 1 次提交
  9. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  10. 27 6月, 2013 9 次提交
  11. 19 6月, 2013 2 次提交
  12. 31 5月, 2013 1 次提交
  13. 28 5月, 2013 3 次提交
  14. 10 5月, 2013 1 次提交
  15. 07 5月, 2013 1 次提交
  16. 26 4月, 2013 1 次提交
    • V
      sched: Fix init NOHZ_IDLE flag · 25f55d9d
      Vincent Guittot 提交于
      On my SMP platform which is made of 5 cores in 2 clusters, I
      have the nr_busy_cpu field of sched_group_power struct that is
      not null when the platform is fully idle - which makes the
      scheduler unhappy.
      
      The root cause is:
      
      During the boot sequence, some CPUs reach the idle loop and set
      their NOHZ_IDLE flag while waiting for others CPUs to boot. But
      the nr_busy_cpus field is initialized later with the assumption
      that all CPUs are in the busy state whereas some CPUs have
      already set their NOHZ_IDLE flag.
      
      More generally, the NOHZ_IDLE flag must be initialized when new
      sched_domains are created in order to ensure that NOHZ_IDLE and
      nr_busy_cpus are aligned.
      
      This condition can be ensured by adding a synchronize_rcu()
      between the destruction of old sched_domains and the creation of
      new ones so the NOHZ_IDLE flag will not be updated with old
      sched_domain once it has been initialized. But this solution
      introduces a additionnal latency in the rebuild sequence that is
      called during cpu hotplug.
      
      As suggested by Frederic Weisbecker, another solution is to have
      the same rcu lifecycle for both NOHZ_IDLE and sched_domain
      struct. A new nohz_idle field is added to sched_domain so both
      status and sched_domain will share the same RCU lifecycle and
      will be always synchronized. In addition, there is no more need
      to protect nohz_idle against concurrent access as it is only
      modified by 2 exclusive functions called by local cpu.
      
      This solution has been prefered to the creation of a new struct
      with an extra pointer indirection for sched_domain.
      
      The synchronization is done at the cost of :
      
       - An additional indirection and a rcu_dereference for accessing nohz_idle.
       - We use only the nohz_idle field of the top sched_domain.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: linaro-kernel@lists.linaro.org
      Cc: peterz@infradead.org
      Cc: fweisbec@gmail.com
      Cc: pjt@google.com
      Cc: rostedt@goodmis.org
      Cc: efault@gmx.de
      Link: http://lkml.kernel.org/r/1366729142-14662-1-git-send-email-vincent.guittot@linaro.org
      [ Fixed !NO_HZ build bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      25f55d9d
  17. 24 4月, 2013 2 次提交