• K
    sched: fix improper load balance across sched domain · 908a7c1b
    Ken Chen 提交于
    We recently discovered a nasty performance bug in the kernel CPU load
    balancer where we were hit by 50% performance regression.
    
    When tasks are assigned to a subset of CPUs that span across
    sched_domains (either ccNUMA node or the new multi-core domain) via
    cpu affinity, kernel fails to perform proper load balance at
    these domains, due to several logic in find_busiest_group() miss
    identified busiest sched group within a given domain. This leads to
    inadequate load balance and causes 50% performance hit.
    
    To give you a concrete example, on a dual-core, 2 socket numa system,
    there are 4 logical cpu, organized as:
    
    CPU0 attaching sched-domain:
     domain 0: span 0003  groups: 0001 0002
     domain 1: span 000f  groups: 0003 000c
    CPU1 attaching sched-domain:
     domain 0: span 0003  groups: 0002 0001
     domain 1: span 000f  groups: 0003 000c
    CPU2 attaching sched-domain:
     domain 0: span 000c  groups: 0004 0008
     domain 1: span 000f  groups: 000c 0003
    CPU3 attaching sched-domain:
     domain 0: span 000c  groups: 0008 0004
     domain 1: span 000f  groups: 000c 0003
    
    If I run 2 tasks with CPU affinity set to 0x5.  There are situation
    where cpu0 has run queue length of 2, and cpu2 will be idle.  The
    kernel load balancer is unable to balance out these two tasks over
    cpu0 and cpu2 due to at least three logics in find_busiest_group()
    that heavily bias load balance towards power saving mode. e.g. while
    determining "busiest" variable, kernel only set it when
    "sum_nr_running > group_capacity".  This test is flawed that
    "sum_nr_running" is not necessary same as
    sum-tasks-allowed-to-run-within-the sched-group.  The end result is
    that kernel "think" everything is balanced, but in reality we have an
    imbalance and thus causing one CPU to be over-subscribed and leaving
    other idle.  There are two other logic in the same function will also
    causing similar effect.  The nastiness of this bug is that kernel not
    be able to get unstuck in this unfortunate broken state.  From what
    we've seen in our environment, kernel will stuck in imbalanced state
    for extended period of time and it is also very easy for the kernel to
    stuck into that state (it's pretty much 100% reproducible for us).
    
    So proposing the following fix: add addition logic in
    find_busiest_group to detect intrinsic imbalance within the busiest
    group.  When such condition is detected, load balance goes into spread
    mode instead of default grouping mode.
    Signed-off-by: NKen Chen <kenchen@google.com>
    Signed-off-by: NIngo Molnar <mingo@elte.hu>
    908a7c1b
sched.c 169.8 KB