提交 ea67821b 编写于 作者: V Vincent Guittot 提交者: Ingo Molnar

sched: Replace capacity_factor by usage

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE.

'struct sg_lb_stats:group_capacity_factor' divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not.

But the 'group_capacity_factor' concept is hardly working for SMT systems, it
sometimes works for big cores but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
   (640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
   (div_round_closest(3x640/1024) = 2) which means that it will be seen as
   overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
   (1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
   (at max and thanks to the fix [0] for SMT system that prevent the apparition
   of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
   reduced to nearly nothing), the capacity factor of the group will still be 4
   (div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics:

  - The available CPU's capacity for CFS tasks which is already used by
    load_balance().

  - The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
    has been re-introduced to compute the usage of a CPU by CFS tasks.

'group_capacity_factor' and 'group_has_free_capacity' has been removed and replaced
by 'group_no_capacity'. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of 'load_above_capacity'
because local is not overloaded.

[1] 9a5d9ba6 ("sched/fair: Allow calculate_imbalance() to move idle cpus")
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Morten.Rasmussen@arm.com
Cc: dietmar.eggemann@arm.com
Cc: efault@gmx.de
Cc: kamalesh@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1425052454-25797-9-git-send-email-vincent.guittot@linaro.org
[ Tidied up the changelog. ]
Signed-off-by: NIngo Molnar <mingo@kernel.org>
上级 8bb5b00c
...@@ -5936,11 +5936,10 @@ struct sg_lb_stats { ...@@ -5936,11 +5936,10 @@ struct sg_lb_stats {
unsigned long group_capacity; unsigned long group_capacity;
unsigned long group_usage; /* Total usage of the group */ unsigned long group_usage; /* Total usage of the group */
unsigned int sum_nr_running; /* Nr tasks running in the group */ unsigned int sum_nr_running; /* Nr tasks running in the group */
unsigned int group_capacity_factor;
unsigned int idle_cpus; unsigned int idle_cpus;
unsigned int group_weight; unsigned int group_weight;
enum group_type group_type; enum group_type group_type;
int group_has_free_capacity; int group_no_capacity;
#ifdef CONFIG_NUMA_BALANCING #ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running; unsigned int nr_numa_running;
unsigned int nr_preferred_running; unsigned int nr_preferred_running;
...@@ -6156,28 +6155,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu) ...@@ -6156,28 +6155,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
} }
/* /*
* Try and fix up capacity for tiny siblings, this is needed when * Check whether the capacity of the rq has been noticeably reduced by side
* things like SD_ASYM_PACKING need f_b_g to select another sibling * activity. The imbalance_pct is used for the threshold.
* which on its own isn't powerful enough. * Return true is the capacity is reduced
*
* See update_sd_pick_busiest() and check_asym_packing().
*/ */
static inline int static inline int
fix_small_capacity(struct sched_domain *sd, struct sched_group *group) check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
{ {
/* return ((rq->cpu_capacity * sd->imbalance_pct) <
* Only siblings can have significantly less than SCHED_CAPACITY_SCALE (rq->cpu_capacity_orig * 100));
*/
if (!(sd->flags & SD_SHARE_CPUCAPACITY))
return 0;
/*
* If ~90% of the cpu_capacity is still there, we're good.
*/
if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
return 1;
return 0;
} }
/* /*
...@@ -6215,37 +6201,56 @@ static inline int sg_imbalanced(struct sched_group *group) ...@@ -6215,37 +6201,56 @@ static inline int sg_imbalanced(struct sched_group *group)
} }
/* /*
* Compute the group capacity factor. * group_has_capacity returns true if the group has spare capacity that could
* * be used by some tasks.
* Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by * We consider that a group has spare capacity if the * number of task is
* first dividing out the smt factor and computing the actual number of cores * smaller than the number of CPUs or if the usage is lower than the available
* and limit unit capacity with that. * capacity for CFS tasks.
* For the latter, we use a threshold to stabilize the state, to take into
* account the variance of the tasks' load and to return true if the available
* capacity in meaningful for the load balancer.
* As an example, an available capacity of 1% can appear but it doesn't make
* any benefit for the load balance.
*/ */
static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group) static inline bool
group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
{ {
unsigned int capacity_factor, smt, cpus; if (sgs->sum_nr_running < sgs->group_weight)
unsigned int capacity, capacity_orig; return true;
capacity = group->sgc->capacity; if ((sgs->group_capacity * 100) >
capacity_orig = group->sgc->capacity_orig; (sgs->group_usage * env->sd->imbalance_pct))
cpus = group->group_weight; return true;
/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */ return false;
smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig); }
capacity_factor = cpus / smt; /* cores */
/*
* group_is_overloaded returns true if the group has more tasks than it can
* handle.
* group_is_overloaded is not equals to !group_has_capacity because a group
* with the exact right number of tasks, has no more spare capacity but is not
* overloaded so both group_has_capacity and group_is_overloaded return
* false.
*/
static inline bool
group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
{
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
capacity_factor = min_t(unsigned, if ((sgs->group_capacity * 100) <
capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE)); (sgs->group_usage * env->sd->imbalance_pct))
if (!capacity_factor) return true;
capacity_factor = fix_small_capacity(env->sd, group);
return capacity_factor; return false;
} }
static enum group_type static enum group_type group_classify(struct lb_env *env,
group_classify(struct sched_group *group, struct sg_lb_stats *sgs) struct sched_group *group,
struct sg_lb_stats *sgs)
{ {
if (sgs->sum_nr_running > sgs->group_capacity_factor) if (sgs->group_no_capacity)
return group_overloaded; return group_overloaded;
if (sg_imbalanced(group)) if (sg_imbalanced(group))
...@@ -6306,11 +6311,9 @@ static inline void update_sg_lb_stats(struct lb_env *env, ...@@ -6306,11 +6311,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running; sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
sgs->group_weight = group->group_weight; sgs->group_weight = group->group_weight;
sgs->group_capacity_factor = sg_capacity_factor(env, group);
sgs->group_type = group_classify(group, sgs);
if (sgs->group_capacity_factor > sgs->sum_nr_running) sgs->group_no_capacity = group_is_overloaded(env, sgs);
sgs->group_has_free_capacity = 1; sgs->group_type = group_classify(env, group, sgs);
} }
/** /**
...@@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd ...@@ -6432,18 +6435,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
/* /*
* In case the child domain prefers tasks go to siblings * In case the child domain prefers tasks go to siblings
* first, lower the sg capacity factor to one so that we'll try * first, lower the sg capacity so that we'll try
* and move all the excess tasks away. We lower the capacity * and move all the excess tasks away. We lower the capacity
* of a group only if the local group has the capacity to fit * of a group only if the local group has the capacity to fit
* these excess tasks, i.e. nr_running < group_capacity_factor. The * these excess tasks. The extra check prevents the case where
* extra check prevents the case where you always pull from the * you always pull from the heaviest group when it is already
* heaviest group when it is already under-utilized (possible * under-utilized (possible with a large weight task outweighs
* with a large weight task outweighs the tasks on the system). * the tasks on the system).
*/ */
if (prefer_sibling && sds->local && if (prefer_sibling && sds->local &&
sds->local_stat.group_has_free_capacity) { group_has_capacity(env, &sds->local_stat) &&
sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U); (sgs->sum_nr_running > 1)) {
sgs->group_type = group_classify(sg, sgs); sgs->group_no_capacity = 1;
sgs->group_type = group_overloaded;
} }
if (update_sd_pick_busiest(env, sds, sg, sgs)) { if (update_sd_pick_busiest(env, sds, sg, sgs)) {
...@@ -6623,11 +6627,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s ...@@ -6623,11 +6627,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/ */
if (busiest->group_type == group_overloaded && if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) { local->group_type == group_overloaded) {
load_above_capacity = load_above_capacity = busiest->sum_nr_running *
(busiest->sum_nr_running - busiest->group_capacity_factor); SCHED_LOAD_SCALE;
if (load_above_capacity > busiest->group_capacity)
load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE); load_above_capacity -= busiest->group_capacity;
load_above_capacity /= busiest->group_capacity; else
load_above_capacity = ~0UL;
} }
/* /*
...@@ -6690,6 +6695,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env) ...@@ -6690,6 +6695,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
local = &sds.local_stat; local = &sds.local_stat;
busiest = &sds.busiest_stat; busiest = &sds.busiest_stat;
/* ASYM feature bypasses nice load balance check */
if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) && if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds)) check_asym_packing(env, &sds))
return sds.busiest; return sds.busiest;
...@@ -6710,8 +6716,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env) ...@@ -6710,8 +6716,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
goto force_balance; goto force_balance;
/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */ /* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity && if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
!busiest->group_has_free_capacity) busiest->group_no_capacity)
goto force_balance; goto force_balance;
/* /*
...@@ -6770,7 +6776,7 @@ static struct rq *find_busiest_queue(struct lb_env *env, ...@@ -6770,7 +6776,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i; int i;
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
unsigned long capacity, capacity_factor, wl; unsigned long capacity, wl;
enum fbq_type rt; enum fbq_type rt;
rq = cpu_rq(i); rq = cpu_rq(i);
...@@ -6799,9 +6805,6 @@ static struct rq *find_busiest_queue(struct lb_env *env, ...@@ -6799,9 +6805,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
continue; continue;
capacity = capacity_of(i); capacity = capacity_of(i);
capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
if (!capacity_factor)
capacity_factor = fix_small_capacity(env->sd, group);
wl = weighted_cpuload(i); wl = weighted_cpuload(i);
...@@ -6809,7 +6812,9 @@ static struct rq *find_busiest_queue(struct lb_env *env, ...@@ -6809,7 +6812,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* When comparing with imbalance, use weighted_cpuload() * When comparing with imbalance, use weighted_cpuload()
* which is not scaled with the cpu capacity. * which is not scaled with the cpu capacity.
*/ */
if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
if (rq->nr_running == 1 && wl > env->imbalance &&
!check_cpu_capacity(rq, env->sd))
continue; continue;
/* /*
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册