• D
    sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 192fa322
    Dave Chiluk 提交于
    commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
    
    It has been observed, that highly-threaded, non-cpu-bound applications
    running under cpu.cfs_quota_us constraints can hit a high percentage of
    periods throttled while simultaneously not consuming the allocated
    amount of quota. This use case is typical of user-interactive non-cpu
    bound applications, such as those running in kubernetes or mesos when
    run on multiple cpu cores.
    
    This has been root caused to cpu-local run queue being allocated per cpu
    bandwidth slices, and then not fully using that slice within the period.
    At which point the slice and quota expires. This expiration of unused
    slice results in applications not being able to utilize the quota for
    which they are allocated.
    
    The non-expiration of per-cpu slices was recently fixed by
    'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
    condition")'. Prior to that it appears that this had been broken since
    at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
    cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
    added the following conditional which resulted in slices never being
    expired.
    
    if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
            /* extend local deadline, drift is bounded above by 2 ticks */
            cfs_rq->runtime_expires += TICK_NSEC;
    
    Because this was broken for nearly 5 years, and has recently been fixed
    and is now being noticed by many users running kubernetes
    (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
    that the mechanisms around expiring runtime should be removed
    altogether.
    
    This allows quota already allocated to per-cpu run-queues to live longer
    than the period boundary. This allows threads on runqueues that do not
    use much CPU to continue to use their remaining slice over a longer
    period of time than cpu.cfs_period_us. However, this helps prevent the
    above condition of hitting throttling while also not fully utilizing
    your cpu quota.
    
    This theoretically allows a machine to use slightly more than its
    allotted quota in some periods. This overflow would be bounded by the
    remaining quota left on each per-cpu runqueueu. This is typically no
    more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
    change nothing, as they should theoretically fully utilize all of their
    quota in each period. For user-interactive tasks as described above this
    provides a much better user/application experience as their cpu
    utilization will more closely match the amount they requested when they
    hit throttling. This means that cpu limits no longer strictly apply per
    period for non-cpu bound applications, but that they are still accurate
    over longer timeframes.
    
    This greatly improves performance of high-thread-count, non-cpu bound
    applications with low cfs_quota_us allocation on high-core-count
    machines. In the case of an artificial testcase (10ms/100ms of quota on
    80 CPU machine), this commit resulted in almost 30x performance
    improvement, while still maintaining correct cpu quota restrictions.
    That testcase is available at https://github.com/indeedeng/fibtest.
    
    Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
    Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
    Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: NPhil Auld <pauld@redhat.com>
    Reviewed-by: NBen Segall <bsegall@google.com>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: John Hammond <jhammond@indeed.com>
    Cc: Jonathan Corbet <corbet@lwn.net>
    Cc: Kyle Anderson <kwa@yelp.com>
    Cc: Gabriel Munos <gmunoz@netflix.com>
    Cc: Peter Oskolkov <posk@posk.io>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: Brendan Gregg <bgregg@netflix.com>
    Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
    Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
    192fa322
sched.h 58.0 KB