1. 22 2月, 2014 2 次提交
  2. 11 2月, 2014 1 次提交
    • P
      sched: Push down pre_schedule() and idle_balance() · 38033c37
      Peter Zijlstra 提交于
      This patch both merged idle_balance() and pre_schedule() and pushes
      both of them into pick_next_task().
      
      Conceptually pre_schedule() and idle_balance() are rather similar,
      both are used to pull more work onto the current CPU.
      
      We cannot however first move idle_balance() into pre_schedule_fair()
      since there is no guarantee the last runnable task is a fair task, and
      thus we would miss newidle balances.
      
      Similarly, the dl and rt pre_schedule calls must be ran before
      idle_balance() since their respective tasks have higher priority and
      it would not do to delay their execution searching for less important
      tasks first.
      
      However, by noticing that pick_next_tasks() already traverses the
      sched_class hierarchy in the right order, we can get the right
      behaviour and do away with both calls.
      
      We must however change the special case optimization to also require
      that prev is of sched_class_fair, otherwise we can miss doing a dl or
      rt pull where we needed one.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38033c37
  3. 10 2月, 2014 3 次提交
  4. 09 2月, 2014 1 次提交
  5. 13 1月, 2014 8 次提交
    • D
      sched: Reduce trigger_load_balance() parameters · 7caff66f
      Daniel Lezcano 提交于
      The cpu information is already stored in the struct rq, so no need to pass it
      as parameter to the trigger_load_balance function.
      
      Cc: linaro-kernel@lists.linaro.org
      Cc: preeti.lkml@gmail.com
      Cc: mingo@redhat.com
      Cc: peterz@infradead.org
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1389008085-9069-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7caff66f
    • P
      sched/deadline: Remove the sysctl_sched_dl knobs · 1724813d
      Peter Zijlstra 提交于
      Remove the deadline specific sysctls for now. The problem with them is
      that the interaction with the exisiting rt knobs is nearly impossible
      to get right.
      
      The current (as per before this patch) situation is that the rt and dl
      bandwidth is completely separate and we enforce rt+dl < 100%. This is
      undesirable because this means that the rt default of 95% leaves us
      hardly any room, even though dl tasks are saver than rt tasks.
      
      Another proposed solution was (a discarted patch) to have the dl
      bandwidth be a fraction of the rt bandwidth. This is highly
      confusing imo.
      
      Furthermore neither proposal is consistent with the situation we
      actually want; which is rt tasks ran from a dl server. In which case
      the rt bandwidth is a direct subset of dl.
      
      So whichever way we go, the introduction of dl controls at this point
      is painful. Therefore remove them and instead share the rt budget.
      
      This means that for now the rt knobs are used for dl admission control
      and the dl runtime is accounted against the rt runtime. I realise that
      this isn't entirely desirable either; but whatever we do we appear to
      need to change the interface later, so better have a small interface
      for now.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1724813d
    • J
      sched/deadline: speed up SCHED_DEADLINE pushes with a push-heap · 6bfd6d72
      Juri Lelli 提交于
      Data from tests confirmed that the original active load balancing
      logic didn't scale neither in the number of CPU nor in the number of
      tasks (as sched_rt does).
      
      Here we provide a global data structure to keep track of deadlines
      of the running tasks in the system. The structure is composed by
      a bitmask showing the free CPUs and a max-heap, needed when the system
      is heavily loaded.
      
      The implementation and concurrent access scheme are kept simple by
      design. However, our measurements show that we can compete with sched_rt
      on large multi-CPUs machines [1].
      
      Only the push path is addressed, the extension to use this structure
      also for pull decisions is straightforward. However, we are currently
      evaluating different (in order to decrease/avoid contention) data
      structures to solve possibly both problems. We are also going to re-run
      tests considering recent changes inside cpupri [2].
      
       [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
       [2] http://www.spinics.net/lists/linux-rt-users/msg06778.htmlSigned-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6bfd6d72
    • D
      sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks · 332ac17e
      Dario Faggioli 提交于
      In order of deadline scheduling to be effective and useful, it is
      important that some method of having the allocation of the available
      CPU bandwidth to tasks and task groups under control.
      This is usually called "admission control" and if it is not performed
      at all, no guarantee can be given on the actual scheduling of the
      -deadline tasks.
      
      Since when RT-throttling has been introduced each task group have a
      bandwidth associated to itself, calculated as a certain amount of
      runtime over a period. Moreover, to make it possible to manipulate
      such bandwidth, readable/writable controls have been added to both
      procfs (for system wide settings) and cgroupfs (for per-group
      settings).
      
      Therefore, the same interface is being used for controlling the
      bandwidth distrubution to -deadline tasks and task groups, i.e.,
      new controls but with similar names, equivalent meaning and with
      the same usage paradigm are added.
      
      However, more discussion is needed in order to figure out how
      we want to manage SCHED_DEADLINE bandwidth at the task group level.
      Therefore, this patch adds a less sophisticated, but actually
      very sensible, mechanism to ensure that a certain utilization
      cap is not overcome per each root_domain (the single rq for !SMP
      configurations).
      
      Another main difference between deadline bandwidth management and
      RT-throttling is that -deadline tasks have bandwidth on their own
      (while -rt ones doesn't!), and thus we don't need an higher level
      throttling mechanism to enforce the desired bandwidth.
      
      This patch, therefore:
      
       - adds system wide deadline bandwidth management by means of:
          * /proc/sys/kernel/sched_dl_runtime_us,
          * /proc/sys/kernel/sched_dl_period_us,
         that determine (i.e., runtime / period) the total bandwidth
         available on each CPU of each root_domain for -deadline tasks;
      
       - couples the RT and deadline bandwidth management, i.e., enforces
         that the sum of how much bandwidth is being devoted to -rt
         -deadline tasks to stay below 100%.
      
      This means that, for a root_domain comprising M CPUs, -deadline tasks
      can be created until the sum of their bandwidths stay below:
      
          M * (sched_dl_runtime_us / sched_dl_period_us)
      
      It is also possible to disable this bandwidth management logic, and
      be thus free of oversubscribing the system up to any arbitrary level.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      332ac17e
    • D
      sched/deadline: Add SCHED_DEADLINE inheritance logic · 2d3d891d
      Dario Faggioli 提交于
      Some method to deal with rt-mutexes and make sched_dl interact with
      the current PI-coded is needed, raising all but trivial issues, that
      needs (according to us) to be solved with some restructuring of
      the pi-code (i.e., going toward a proxy execution-ish implementation).
      
      This is under development, in the meanwhile, as a temporary solution,
      what this commits does is:
      
       - ensure a pi-lock owner with waiters is never throttled down. Instead,
         when it runs out of runtime, it immediately gets replenished and it's
         deadline is postponed;
      
       - the scheduling parameters (relative deadline and default runtime)
         used for that replenishments --during the whole period it holds the
         pi-lock-- are the ones of the waiting task with earliest deadline.
      
      Acting this way, we provide some kind of boosting to the lock-owner,
      still by using the existing (actually, slightly modified by the previous
      commit) pi-architecture.
      
      We would stress the fact that this is only a surely needed, all but
      clean solution to the problem. In the end it's only a way to re-start
      discussion within the community. So, as always, comments, ideas, rants,
      etc.. are welcome! :-)
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Added !RT_MUTEXES build fix. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d3d891d
    • J
      sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic · 1baca4ce
      Juri Lelli 提交于
      Introduces data structures relevant for implementing dynamic
      migration of -deadline tasks and the logic for checking if
      runqueues are overloaded with -deadline tasks and for choosing
      where a task should migrate, when it is the case.
      
      Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
      be moved among CPUs when necessary. It is also possible to bind a
      task to a (set of) CPU(s), thus restricting its capability of
      migrating, or forbidding migrations at all.
      
      The very same approach used in sched_rt is utilised:
       - -deadline tasks are kept into CPU-specific runqueues,
       - -deadline tasks are migrated among runqueues to achieve the
         following:
          * on an M-CPU system the M earliest deadline ready tasks
            are always running;
          * affinity/cpusets settings of all the -deadline tasks is
            always respected.
      
      Therefore, this very special form of "load balancing" is done with
      an active method, i.e., the scheduler pushes or pulls tasks between
      runqueues when they are woken up and/or (de)scheduled.
      IOW, every time a preemption occurs, the descheduled task might be sent
      to some other CPU (depending on its deadline) to continue executing
      (push). On the other hand, every time a CPU becomes idle, it might pull
      the second earliest deadline ready task from some other CPU.
      
      To enforce this, a pull operation is always attempted before taking any
      scheduling decision (pre_schedule()), as well as a push one after each
      scheduling decision (post_schedule()). In addition, when a task arrives
      or wakes up, the best CPU where to resume it is selected taking into
      account its affinity mask, the system topology, but also its deadline.
      E.g., from the scheduling point of view, the best CPU where to wake
      up (and also where to push) a task is the one which is running the task
      with the latest deadline among the M executing ones.
      
      In order to facilitate these decisions, per-runqueue "caching" of the
      deadlines of the currently running and of the first ready task is used.
      Queued but not running tasks are also parked in another rb-tree to
      speed-up pushes.
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1baca4ce
    • D
      sched/deadline: Add SCHED_DEADLINE structures & implementation · aab03e05
      Dario Faggioli 提交于
      Introduces the data structures, constants and symbols needed for
      SCHED_DEADLINE implementation.
      
      Core data structure of SCHED_DEADLINE are defined, along with their
      initializers. Hooks for checking if a task belong to the new policy
      are also added where they are needed.
      
      Adds a scheduling class, in sched/dl.c and a new policy called
      SCHED_DEADLINE. It is an implementation of the Earliest Deadline
      First (EDF) scheduling algorithm, augmented with a mechanism (called
      Constant Bandwidth Server, CBS) that makes it possible to isolate
      the behaviour of tasks between each other.
      
      The typical -deadline task will be made up of a computation phase
      (instance) which is activated on a periodic or sporadic fashion. The
      expected (maximum) duration of such computation is called the task's
      runtime; the time interval by which each instance need to be completed
      is called the task's relative deadline. The task's absolute deadline
      is dynamically calculated as the time instant a task (better, an
      instance) activates plus the relative deadline.
      
      The EDF algorithms selects the task with the smallest absolute
      deadline as the one to be executed first, while the CBS ensures each
      task to run for at most its runtime every (relative) deadline
      length time interval, avoiding any interference between different
      tasks (bandwidth isolation).
      Thanks to this feature, also tasks that do not strictly comply with
      the computational model sketched above can effectively use the new
      policy.
      
      To summarize, this patch:
       - introduces the data structures, constants and symbols needed;
       - implements the core logic of the scheduling algorithm in the new
         scheduling class file;
       - provides all the glue code between the new scheduling class and
         the core scheduler and refines the interactions between sched/dl
         and the other existing scheduling classes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <michael@amarulasolutions.com>
      Signed-off-by: NFabio Checconi <fchecconi@gmail.com>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aab03e05
    • D
      sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a
      Dario Faggioli 提交于
      Add the syscalls needed for supporting scheduling algorithms
      with extended scheduling parameters (e.g., SCHED_DEADLINE).
      
      In general, it makes possible to specify a periodic/sporadic task,
      that executes for a given amount of runtime at each instance, and is
      scheduled according to the urgency of their own timing constraints,
      i.e.:
      
       - a (maximum/typical) instance execution time,
       - a minimum interval between consecutive instances,
       - a time constraint by which each instance must be completed.
      
      Thus, both the data structure that holds the scheduling parameters of
      the tasks and the system calls dealing with it must be extended.
      Unfortunately, modifying the existing struct sched_param would break
      the ABI and result in potentially serious compatibility issues with
      legacy binaries.
      
      For these reasons, this patch:
      
       - defines the new struct sched_attr, containing all the fields
         that are necessary for specifying a task in the computational
         model described above;
      
       - defines and implements the new scheduling related syscalls that
         manipulate it, i.e., sched_setattr() and sched_getattr().
      
      Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
      proof of concept and for developing and testing purposes. Making them
      available on other architectures is straightforward.
      
      Since no "user" for these new parameters is introduced in this patch,
      the implementation of the new system calls is just identical to their
      already existing counterpart. Future patches that implement scheduling
      policies able to exploit the new data structure must also take care of
      modifying the sched_*attr() calls accordingly with their own purposes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      [ Rewrote to use sched_attr. ]
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Removed sched_setscheduler2() for now. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d50dde5a
  6. 27 11月, 2013 1 次提交
  7. 06 11月, 2013 1 次提交
    • P
      sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus · 37dc6b50
      Preeti U Murthy 提交于
      nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
      number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
      flag set.  Therefore instead of updating nr_busy_cpus at every level
      of sched domain, since it is irrelevant, we can update this parameter
      only at the parent domain of the sd which has this flag set. Introduce
      a per-cpu parameter sd_busy which represents this parent domain.
      
      In nohz_kick_needed() we directly query the nr_busy_cpus parameter
      associated with the groups of sd_busy.
      
      By associating sd_busy with the highest domain which has
      SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
      which could have this flag set and trigger nohz_idle_balancing if any
      of the levels have more than one busy cpu.
      
      sd_busy is irrelevant for asymmetric load balancing. However sd_asym
      has been introduced to represent the highest sched domain which has
      SD_ASYM_PACKING flag set so that it can be queried directly when
      required.
      
      While we are at it, we might as well change the nohz_idle parameter to
      be updated at the sd_busy domain level alone and not the base domain
      level of a CPU.  This will unify the concept of busy cpus at just one
      level of sched domain where it is currently used.
      
      Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: svaidy@linux.vnet.ibm.com
      Cc: vincent.guittot@linaro.org
      Cc: bitbucket@online.de
      Cc: benh@kernel.crashing.org
      Cc: anton@samba.org
      Cc: Morten.Rasmussen@arm.com
      Cc: pjt@google.com
      Cc: peterz@infradead.org
      Cc: mikey@neuling.org
      Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37dc6b50
  8. 29 10月, 2013 1 次提交
  9. 16 10月, 2013 1 次提交
    • P
      sched: Fix race in migrate_swap_stop() · 74602315
      Peter Zijlstra 提交于
      There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
      places with task T, on CPU B.
      
      Task P:
        - call migrate_swap
      Task T:
        - go to sleep, removing itself from the runqueue
      Task P:
        - double lock the runqueues on CPU A & B
      Task T:
        - get woken up, place itself on the runqueue of CPU C
      Task P:
        - see that task T is on a runqueue, and pretend to remove it
          from the runqueue on CPU B
      
      Now CPUs B & C both have corrupted scheduler data structures.
      
      This patch fixes it, by holding the pi_lock for both of the tasks
      involved in the migrate swap. This prevents task T from waking up,
      and placing itself onto another runqueue, until after migrate_swap
      has released all locks.
      
      This means that, when migrate_swap checks, task T will be either
      on the runqueue where it was originally seen, or not on any
      runqueue at all. Migrate_swap deals correctly with of those cases.
      Tested-by: NJoe Mario <jmario@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: hannes@cmpxchg.org
      Cc: aarcange@redhat.com
      Cc: srikar@linux.vnet.ibm.com
      Cc: tglx@linutronix.de
      Cc: hpa@zytor.com
      Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74602315
  10. 09 10月, 2013 7 次提交
  11. 20 9月, 2013 1 次提交
  12. 13 9月, 2013 1 次提交
    • P
      sched/fair: Rewrite group_imb trigger · 6263322c
      Peter Zijlstra 提交于
      Change the group_imb detection from the old 'load-spike' detector to
      an actual imbalance detector. We set it from the lower domain balance
      pass when it fails to create a balance in the presence of task
      affinities.
      
      The advantage is that this should no longer generate the false
      positive group_imb conditions generated by transient load spikes from
      the normal balancing/bulk-wakeup etc. behaviour.
      
      While I haven't actually observed those they could happen.
      
      I'm not entirely happy with this patch; it somehow feels a little
      fragile.
      
      Nor does it solve the biggest issue I have with the group_imb code; it
      it still a fragile construct in that once we 'fixed' the imbalance
      we'll not detect the group_imb again and could end up re-creating it.
      
      That said, this patch does seem to preserve behaviour for the
      described degenerate case. In particular on my 2*6*2 wsm-ep:
      
        taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      ends up with 9 spinners, each on their own CPU; whereas if you disable
      the group_imb code that typically doesn't happen (you'll get one pair
      sharing a CPU most of the time).
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6263322c
  13. 09 8月, 2013 1 次提交
    • T
      cgroup: s/cgroup_subsys_state/cgroup_css/ s/task_subsys_state/task_css/ · 8af01f56
      Tejun Heo 提交于
      The names of the two struct cgroup_subsys_state accessors -
      cgroup_subsys_state() and task_subsys_state() - are somewhat awkward.
      The former clashes with the type name and the latter doesn't even
      indicate it's somehow related to cgroup.
      
      We're about to revamp large portion of cgroup API, so, let's rename
      them so that they're less awkward.  Most per-controller usages of the
      accessors are localized in accessor wrappers and given the amount of
      scheduled changes, this isn't gonna add any noticeable headache.
      
      Rename cgroup_subsys_state() to cgroup_css() and task_subsys_state()
      to task_css().  This patch is pure rename.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8af01f56
  14. 23 7月, 2013 2 次提交
    • P
      sched: Micro-optimize the smart wake-affine logic · 7d9ffa89
      Peter Zijlstra 提交于
      Smart wake-affine is using node-size as the factor currently, but the overhead
      of the mask operation is high.
      
      Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
      the highest cache-share domain size, and make it to be the new factor, in order
      to reduce the overhead and make it more reasonable.
      Tested-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7d9ffa89
    • V
      sched: Move h_load calculation to task_h_load() · 68520796
      Vladimir Davydov 提交于
      The bad thing about update_h_load(), which computes hierarchical load
      factor for task groups, is that it is called for each task group in the
      system before every load balancer run, and since rebalance can be
      triggered very often, this function can eat really a lot of cpu time if
      there are many cpu cgroups in the system.
      
      Although the situation was improved significantly by commit a35b6466
      ('sched, cgroup: Reduce rq->lock hold times for large cgroup
      hierarchies'), the problem still can arise under some kinds of loads,
      e.g. when cpus are switching from idle to busy and back very frequently.
      
      For instance, when I start 1000 of processes that wake up every
      millisecond on my 8 cpus host, 'top' and 'perf top' show:
      
      Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 243K cycles
        7.57%  [kernel]               [k] __schedule
        7.08%  [kernel]               [k] timerqueue_add
        6.13%  libc-2.12.so           [.] usleep
      
      Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
      usage increases significantly although the 'wakers' are still executing
      in the root cpu cgroup:
      
      Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
      Events: 230K cycles
       24.56%  [kernel]            [k] tg_load_down
        5.76%  [kernel]            [k] __schedule
      
      This happens because this particular kind of load triggers 'new idle'
      rebalance very frequently, which requires calling update_h_load(),
      which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
      though it is absolutely useless, because idle cpu cgroups have no tasks
      to pull.
      
      This patch tries to improve the situation by making h_load calculation
      proceed only when h_load is really necessary. To achieve this, it
      substitutes update_h_load() with update_cfs_rq_h_load(), which computes
      h_load only for a given cfs_rq and all its ascendants, and makes the
      load balancer call this function whenever it considers if a task should
      be pulled, i.e. it moves h_load calculations directly to task_h_load().
      For h_load of the same cfs_rq not to be updated multiple times (in case
      several tasks in the same cgroup are considered during the same balance
      run), the patch keeps the time of the last h_load update for each cfs_rq
      and breaks calculation when it finds h_load to be uptodate.
      
      The benefit of it is that h_load is computed only for those cfs_rq's,
      which really need it, in particular all idle task groups are skipped.
      Although this, in fact, moves h_load calculation under rq lock, it
      should not affect latency much, because the amount of work done under rq
      lock while trying to pull tasks is limited by sched_nr_migrate.
      
      After the patch applied with the setup described above (1000 wakers in
      the root cgroup and 10000 idle cgroups), I get:
      
      Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 242K cycles
        7.57%  [kernel]                  [k] __schedule
        6.70%  [kernel]                  [k] timerqueue_add
        5.93%  libc-2.12.so              [.] usleep
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68520796
  15. 27 6月, 2013 7 次提交
  16. 19 6月, 2013 1 次提交
  17. 28 5月, 2013 1 次提交