1. 24 4月, 2020 2 次提交
  2. 22 4月, 2020 1 次提交
    • H
      sched/fair: Fix race between runtime distribution and assignment · 70a23044
      Huaixin Chang 提交于
      fix #25892693
      
      commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream
      
      Currently, there is a potential race between distribute_cfs_runtime()
      and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
      distributes without holding lock and finds out there is not enough
      runtime to charge against after distribution. Because
      assign_cfs_rq_runtime() might be called during distribution, and use
      cfs_b->runtime at the same time.
      
      Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
      and cfs period timer runs, slow threads might run and sleep, returning
      unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
      pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
      a lot. If runtime distributed is large too, over-use of runtime happens.
      
      A runtime over-using by about 70 percent of quota is seen when we
      test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
      95 slow threads in test group, configure 10ms quota for this group and
      see the CPU usage of fibtest is 17.0%, which is far from than the
      expected 10%.
      
      On a smaller machine with 32 cores, we also run fibtest with 96
      threads. CPU usage is more than 12%, which is also more than expected
      10%. This shows that on similar workloads, this race do affect CPU
      bandwidth control.
      
      Solve this by holding lock inside distribute_cfs_runtime().
      
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
      70a23044
  3. 17 1月, 2020 1 次提交
    • H
      alinux: sched/fair: use static load in wake_affine_weight · d2440c99
      Huaixin Chang 提交于
      For a long time runnable cpu load has been used in selecting task rq
      when waking up tasks. Recent test has shown for test load with a large
      quantity of short running tasks and almost full cpu utility, static load
      is more helpful.
      
      In our e2e tests, runnable load avg of java threads ranges from less than
      10 to as large as 362, while these java threads are no different from
      each other, and should be treated in the same way. After using static
      load, qps imporvement has been seen in multiple test cases.
      
      A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo
      WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load
      in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This
      feature is kept off by default.
      
      Test is done on the following hardware:
      
      4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
      
      In tests with 120 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	33170.63                34614.95 (+4.35%)
      
      In tests with 160 threads and sql loglevel configured to info:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	35888.71                38247.20 (+6.57%)
      
      In tests with 160 threads and sql loglevel configured to warn:
      
      	NO_WA_STATIC_WEIGHT     WA_STATIC_WEIGHT
      	39118.72                39698.72 (+1.48%)
      Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
      Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      d2440c99
  4. 27 12月, 2019 1 次提交
  5. 13 12月, 2019 1 次提交
    • X
      sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision · 742f2319
      Xuewei Zhang 提交于
      commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream.
      
      The quota/period ratio is used to ensure a child task group won't get
      more bandwidth than the parent task group, and is calculated as:
      
        normalized_cfs_quota() = [(quota_us << 20) / period_us]
      
      If the quota/period ratio was changed during this scaling due to
      precision loss, it will cause inconsistency between parent and child
      task groups.
      
      See below example:
      
      A userspace container manager (kubelet) does three operations:
      
       1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
       2) Create a few children cgroups.
       3) Set quota to 1,000us and period to 10,000us on a child cgroup.
      
      These operations are expected to succeed. However, if the scaling of
      147/128 happens before step 3, quota and period of the parent cgroup
      will be changed:
      
        new_quota: 1148437ns,   1148us
       new_period: 11484375ns, 11484us
      
      And when step 3 comes in, the ratio of the child cgroup will be
      104857, which will be larger than the parent cgroup ratio (104821),
      and will fail.
      
      Scaling them by a factor of 2 will fix the problem.
      Tested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NXuewei Zhang <xueweiz@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
      Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      742f2319
  6. 01 12月, 2019 1 次提交
  7. 13 11月, 2019 2 次提交
    • Q
      sched/fair: Fix -Wunused-but-set-variable warnings · e9c0fc4a
      Qian Cai 提交于
      commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream.
      
      Commit:
      
         de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      
      introduced a few compilation warnings:
      
        kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
        kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
        kernel/sched/fair.c: In function 'start_cfs_bandwidth':
        kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]
      
      Also, __refill_cfs_bandwidth_runtime() does no longer update the
      expiration time, so fix the comments accordingly.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pauld@redhat.com
      Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      e9c0fc4a
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151
  8. 05 10月, 2019 2 次提交
    • P
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · 9addfbd4
      Phil Auld 提交于
      [ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ]
      
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      9addfbd4
    • V
      sched/fair: Fix imbalance due to CPU affinity · 417cf53b
      Vincent Guittot 提交于
      [ Upstream commit f6cad8df6b30a5d2bbbd2e698f74b4cafb9fb82b ]
      
      The load_balance() has a dedicated mecanism to detect when an imbalance
      is due to CPU affinity and must be handled at parent level. In this case,
      the imbalance field of the parent's sched_group is set.
      
      The description of sg_imbalanced() gives a typical example of two groups
      of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
      group and 3 CPUs of the second group. Something like:
      
      	{ 0 1 2 3 } { 4 5 6 7 }
      	        *     * * *
      
      But the load_balance fails to fix this UC on my octo cores system
      made of 2 clusters of quad cores.
      
      Whereas the load_balance is able to detect that the imbalanced is due to
      CPU affinity, it fails to fix it because the imbalance field is cleared
      before letting parent level a chance to run. In fact, when the imbalance is
      detected, the load_balance reruns without the CPU with pinned tasks. But
      there is no other running tasks in the situation described above and
      everything looks balanced this time so the imbalance field is immediately
      cleared.
      
      The imbalance field should not be cleared if there is no other task to move
      when the imbalance is detected.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      417cf53b
  9. 16 9月, 2019 1 次提交
  10. 04 8月, 2019 2 次提交
  11. 04 6月, 2019 1 次提交
  12. 31 5月, 2019 1 次提交
  13. 02 5月, 2019 1 次提交
  14. 27 4月, 2019 1 次提交
    • P
      sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup · c3edd427
      Phil Auld 提交于
      [ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ]
      
      With extremely short cfs_period_us setting on a parent task group with a large
      number of children the for loop in sched_cfs_period_timer() can run until the
      watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
      will ever return 0.  The large number of children can make
      do_sched_cfs_period_timer() take longer than the period.
      
       NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
       RIP: 0010:tg_nop+0x0/0x10
        <IRQ>
        walk_tg_tree_from+0x29/0xb0
        unthrottle_cfs_rq+0xe0/0x1a0
        distribute_cfs_runtime+0xd3/0xf0
        sched_cfs_period_timer+0xcb/0x160
        ? sched_cfs_slack_timer+0xd0/0xd0
        __hrtimer_run_queues+0xfb/0x270
        hrtimer_interrupt+0x122/0x270
        smp_apic_timer_interrupt+0x6a/0x140
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
      To prevent this we add protection to the loop that detects when the loop has run
      too many times and scales the period and quota up, proportionally, so that the timer
      can complete before then next period expires.  This preserves the relative runtime
      quota while preventing the hard lockup.
      
      A warning is issued reporting this state and the new values.
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3edd427
  15. 17 4月, 2019 1 次提交
    • M
      sched/fair: Do not re-read ->h_load_next during hierarchical load calculation · cb75a0c5
      Mel Gorman 提交于
      commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.
      
      A NULL pointer dereference bug was reported on a distribution kernel but
      the same issue should be present on mainline kernel. It occured on s390
      but should not be arch-specific.  A partial oops looks like:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        ...
        Call Trace:
          ...
          try_to_wake_up+0xfc/0x450
          vhost_poll_wakeup+0x3a/0x50 [vhost]
          __wake_up_common+0xbc/0x178
          __wake_up_common_lock+0x9e/0x160
          __wake_up_sync_key+0x4e/0x60
          sock_def_readable+0x5e/0x98
      
      The bug hits any time between 1 hour to 3 days. The dereference occurs
      in update_cfs_rq_h_load when accumulating h_load. The problem is that
      cfq_rq->h_load_next is not protected by any locking and can be updated
      by parallel calls to task_h_load. Depending on the compiler, code may be
      generated that re-reads cfq_rq->h_load_next after the check for NULL and
      then oops when reading se->avg.load_avg. The dissassembly showed that it
      was possible to reread h_load_next after the check for NULL.
      
      While this does not appear to be an issue for later compilers, it's still
      an accident if the correct code is generated. Full locking in this path
      would have high overhead so this patch uses READ_ONCE to read h_load_next
      only once and check for NULL before dereferencing. It was confirmed that
      there were no further oops after 10 days of testing.
      
      As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
      potential problems with store tearing.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()")
      Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb75a0c5
  16. 13 2月, 2019 1 次提交
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
  17. 13 1月, 2019 1 次提交
    • L
      sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654 · dc8408ea
      Linus Torvalds 提交于
      commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.
      
      Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
      scheduler under high loads, starting at around the v4.18 time frame,
      and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
      manipulation.
      
      Do a (manual) revert of:
      
        a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      
      It turns out that the list_del_leaf_cfs_rq() introduced by this commit
      is a surprising property that was not considered in followup commits
      such as:
      
        9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
      
      As Vincent Guittot explains:
      
       "I think that there is a bigger problem with commit a9e7f654 and
        cfs_rq throttling:
      
        Let take the example of the following topology TG2 --> TG1 --> root:
      
         1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
            cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
            one path because it has never been used and can't be throttled so
            tmp_alone_branch will point to leaf_cfs_rq_list at the end.
      
         2) Then TG1 is throttled
      
         3) and we add TG3 as a new child of TG1.
      
         4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
            cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
      
        With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list.
        So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
        cfs_rq is removed from the list.
        Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
        but tmp_alone_branch still points to TG3 cfs_rq because its throttled
        parent can't be enqueued when the lock is released.
        tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
      
        So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
        points on another TG cfs_rq, the next TG cfs_rq that will be added,
        will be linked outside rq->leaf_cfs_rq_list - which is bad.
      
        In addition, we can break the ordering of the cfs_rq in
        rq->leaf_cfs_rq_list but this ordering is used to update and
        propagate the update from leaf down to root."
      
      Instead of trying to work through all these cases and trying to reproduce
      the very high loads that produced the lockup to begin with, simplify
      the code temporarily by reverting a9e7f654 - which change was clearly
      not thought through completely.
      
      This (hopefully) gives us a kernel that doesn't lock up so people
      can continue to enjoy their holidays without worrying about regressions. ;-)
      
      [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
      Analyzed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Analyzed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Reported-by: NSargun Dhillon <sargun@sargun.me>
      Reported-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Tested-by: NSargun Dhillon <sargun@sargun.me>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.13+
      Cc: Bin Li <huawei.libin@huawei.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc8408ea
  18. 20 12月, 2018 1 次提交
  19. 01 12月, 2018 1 次提交
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  20. 16 10月, 2018 1 次提交
  21. 11 10月, 2018 1 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
  22. 02 10月, 2018 6 次提交
    • M
      sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task · 37355bdc
      Mel Gorman 提交于
      Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
      should migrate to a local node. This filter avoids excessive ping-ponging
      if a page is shared or used by threads that migrate cross-node frequently.
      
      Threads inherit both page tables and the preferred node ID from the
      parent. This means that threads can trigger hinting faults earlier than
      a new task which delays scanning for a number of seconds. As it can be
      load balanced very early in its lifetime there can be an unnecessary delay
      before it starts migrating thread-local data. This patch migrates private
      pages faster early in the lifetime of a thread using the sequence counter
      as an identifier of new tasks.
      
      With this patch applied, STREAM performance is the same as 4.17 even though
      processes are not spread cross-node prematurely. Other workloads showed
      a mix of minor gains and losses. This is somewhat expected most workloads
      are not very sensitive to the starting conditions of a process.
      
                               4.19.0-rc5             4.19.0-rc5                 4.17.0
                               numab-v1r1       fastmigrate-v1r1                vanilla
      MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
      MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
      MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
      MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37355bdc
    • S
      sched/numa: Avoid task migration for small NUMA improvement · 6fd98e77
      Srikar Dronamraju 提交于
      If NUMA improvement from the task migration is going to be very
      minimal, then avoid task migration.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     198512  205910   3.72673
      1     313559  318491   1.57291
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev     Current  %Change
      8     74761.9  74935.9  0.232739
      1     214874   226796   5.54837
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     180536  189780   5.12031
      1     210281  205695   -2.18089
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     56511.4  60370    6.828
      1     104899   108100   3.05151
      
      1/7 cases is regressing, if we look at events migrate_pages seem
      to vary the most especially in the regressing case. Also some
      amount of variance is expected between different runs of
      Specjbb2005.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,818,546      13,801,554
      migrations                1,149,960       1,151,541
      faults                    385,583         433,246
      cache-misses              55,259,546,768  55,168,691,835
      sched:sched_move_numa     2,257           2,551
      sched:sched_stick_numa    9               24
      sched:sched_swap_numa     512             904
      migrate:mm_migrate_pages  2,225           1,571
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        72692   113682
      numa_hint_faults_local  62270   102163
      numa_hit                238762  240181
      numa_huge_pte_updates   48      36
      numa_interleave         75      64
      numa_local              238676  240103
      numa_other              86      78
      numa_pages_migrated     2225    1564
      numa_pte_updates        98557   134080
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,173,490       3,079,150
      migrations                36,966          31,455
      faults                    108,776         99,081
      cache-misses              12,200,075,320  11,588,126,740
      sched:sched_move_numa     1,264           1
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     0               0
      migrate:mm_migrate_pages  899             36
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        21109   430
      numa_hint_faults_local  17120   77
      numa_hit                72934   71277
      numa_huge_pte_updates   42      0
      numa_interleave         33      22
      numa_local              72866   71218
      numa_other              68      59
      numa_pages_migrated     915     23
      numa_pte_updates        42326   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,312,022    8,707,565
      migrations                231,705      171,342
      faults                    310,242      310,820
      cache-misses              402,324,573  136,115,400
      sched:sched_move_numa     193          215
      sched:sched_stick_numa    0            6
      sched:sched_swap_numa     3            24
      migrate:mm_migrate_pages  93           162
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11838   8985
      numa_hint_faults_local  11216   8154
      numa_hit                90689   93819
      numa_huge_pte_updates   0       0
      numa_interleave         1579    882
      numa_local              89634   93496
      numa_other              1055    323
      numa_pages_migrated     92      169
      numa_pte_updates        12109   9217
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,170,481   2,152,072
      migrations                10,126      10,704
      faults                    160,962     164,376
      cache-misses              10,834,845  3,818,437
      sched:sched_move_numa     10          16
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           7
      migrate:mm_migrate_pages  2           199
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        403     2248
      numa_hint_faults_local  358     1666
      numa_hit                25898   25704
      numa_huge_pte_updates   0       0
      numa_interleave         207     200
      numa_local              25860   25679
      numa_other              38      25
      numa_pages_migrated     2       197
      numa_pte_updates        400     2234
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        110,339,633      93,330,595
      migrations                4,139,812        4,122,061
      faults                    863,622          865,979
      cache-misses              231,838,045,660  225,395,083,479
      sched:sched_move_numa     2,196            2,372
      sched:sched_stick_numa    33               24
      sched:sched_swap_numa     544              769
      migrate:mm_migrate_pages  2,469            1,677
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        85748   91638
      numa_hint_faults_local  66831   78096
      numa_hit                242213  242225
      numa_huge_pte_updates   0       0
      numa_interleave         0       2
      numa_local              242211  242219
      numa_other              2       6
      numa_pages_migrated     2376    1515
      numa_pte_updates        86233   92274
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        59,331,057      51,487,271
      migrations                552,019         537,170
      faults                    266,586         256,921
      cache-misses              73,796,312,990  70,073,831,187
      sched:sched_move_numa     981             576
      sched:sched_stick_numa    54              24
      sched:sched_swap_numa     286             327
      migrate:mm_migrate_pages  713             726
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        14807   12000
      numa_hint_faults_local  5738    5024
      numa_hit                36230   36470
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36228   36465
      numa_other              2       5
      numa_pages_migrated     703     726
      numa_pte_updates        14742   11930
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fd98e77
    • M
      sched/numa: Limit the conditions where scan period is reset · 05cbdf4f
      Mel Gorman 提交于
      migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
      cross-node migration. In the event of excessive load balancing due to
      saturation, this may result in the scan rate being pegged at maximum and
      further overloading the machine.
      
      This patch only resets the scan if NUMA balancing is active, a preferred
      node has been selected and the task is being migrated from the preferred
      node as these are the most harmful. For example, a migration to the preferred
      node does not justify a faster scan rate. Similarly, a migration between two
      nodes that are not preferred is probably bouncing due to over-saturation of
      the machine.  In that case, scanning faster and trapping more NUMA faults
      will further overload the machine.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203370  205332   0.964744
      1     328431  319785   -2.63252
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     206070  206585   0.249915
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188386  189162   0.41192
      1     201566  213760   6.04963
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     59157.4  58736.8  -0.710985
      1     105495   105419   -0.0720413
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,825,492      14,285,708
      migrations                1,152,509       1,180,621
      faults                    371,948         339,114
      cache-misses              55,654,206,041  55,205,631,894
      sched:sched_move_numa     1,856           843
      sched:sched_stick_numa    4               6
      sched:sched_swap_numa     428             219
      migrate:mm_migrate_pages  898             365
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        57146   26907
      numa_hint_faults_local  51612   24279
      numa_hit                238164  239771
      numa_huge_pte_updates   16      0
      numa_interleave         63      68
      numa_local              238085  239688
      numa_other              79      83
      numa_pages_migrated     883     363
      numa_pte_updates        67540   27415
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,288,525       3,202,779
      migrations                38,652          37,186
      faults                    111,678         106,076
      cache-misses              12,111,197,376  12,024,873,744
      sched:sched_move_numa     900             931
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     5               1
      migrate:mm_migrate_pages  714             637
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        18572   17409
      numa_hint_faults_local  14850   14367
      numa_hit                73197   73953
      numa_huge_pte_updates   11      20
      numa_interleave         25      25
      numa_local              73138   73892
      numa_other              59      61
      numa_pages_migrated     712     668
      numa_pte_updates        24021   27276
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,451,543    8,474,013
      migrations                202,804      254,934
      faults                    310,024      320,506
      cache-misses              253,522,507  110,580,458
      sched:sched_move_numa     213          725
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            7
      migrate:mm_migrate_pages  88           145
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11830   22797
      numa_hint_faults_local  11301   21539
      numa_hit                90038   89308
      numa_huge_pte_updates   0       0
      numa_interleave         855     865
      numa_local              89796   88955
      numa_other              242     353
      numa_pages_migrated     88      149
      numa_pte_updates        12039   22930
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,049,153  2,195,628
      migrations                11,405     11,179
      faults                    162,309    149,656
      cache-misses              7,203,343  8,117,515
      sched:sched_move_numa     22         49
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  1          5
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        1693    3577
      numa_hint_faults_local  1669    3476
      numa_hit                25177   26142
      numa_huge_pte_updates   0       0
      numa_interleave         194     358
      numa_local              24993   26042
      numa_other              184     100
      numa_pages_migrated     1       5
      numa_pte_updates        1577    3587
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        94,515,937       100,602,296
      migrations                4,203,554        4,135,630
      faults                    832,697          789,256
      cache-misses              226,248,698,331  226,160,621,058
      sched:sched_move_numa     1,730            1,366
      sched:sched_stick_numa    14               16
      sched:sched_swap_numa     432              374
      migrate:mm_migrate_pages  1,398            1,350
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        80079   47857
      numa_hint_faults_local  68620   39768
      numa_hit                241187  240165
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              241186  240165
      numa_other              1       0
      numa_pages_migrated     1347    1224
      numa_pte_updates        80729   48354
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        63,704,961      58,515,496
      migrations                573,404         564,845
      faults                    230,878         245,807
      cache-misses              76,568,222,781  73,603,757,976
      sched:sched_move_numa     509             996
      sched:sched_stick_numa    31              10
      sched:sched_swap_numa     182             193
      migrate:mm_migrate_pages  541             646
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        8501    13422
      numa_hint_faults_local  2960    5619
      numa_hit                35526   36118
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35526   36116
      numa_other              0       2
      numa_pages_migrated     539     616
      numa_pte_updates        8433    13374
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05cbdf4f
    • S
      sched/numa: Reset scan rate whenever task moves across nodes · 3f9672ba
      Srikar Dronamraju 提交于
      Currently task scan rate is reset when NUMA balancer migrates the task
      to a different node. If NUMA balancer initiates a swap, reset is only
      applicable to the task that initiates the swap. Similarly no scan rate
      reset is done if the task is migrated across nodes by traditional load
      balancer.
      
      Instead move the scan reset to the migrate_task_rq. This ensures the
      task moved out of its preferred node, either gets back to its preferred
      node quickly or finds a new preferred node. Doing so, would be fair to
      all tasks migrating across nodes.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200668  203370   1.3465
      1     321791  328431   2.06345
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     204848  206070   0.59654
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188098  188386   0.153112
      1     200351  201566   0.606436
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58145.9  59157.4  1.73959
      1     103798   105495   1.63491
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,912,183      13,825,492
      migrations                1,155,931       1,152,509
      faults                    367,139         371,948
      cache-misses              54,240,196,814  55,654,206,041
      sched:sched_move_numa     1,571           1,856
      sched:sched_stick_numa    9               4
      sched:sched_swap_numa     463             428
      migrate:mm_migrate_pages  703             898
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        50155   57146
      numa_hint_faults_local  45264   51612
      numa_hit                239652  238164
      numa_huge_pte_updates   36      16
      numa_interleave         68      63
      numa_local              239576  238085
      numa_other              76      79
      numa_pages_migrated     680     883
      numa_pte_updates        71146   67540
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,156,720       3,288,525
      migrations                30,354          38,652
      faults                    97,261          111,678
      cache-misses              12,400,026,826  12,111,197,376
      sched:sched_move_numa     4               900
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               5
      migrate:mm_migrate_pages  20              714
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        272     18572
      numa_hint_faults_local  186     14850
      numa_hit                71362   73197
      numa_huge_pte_updates   0       11
      numa_interleave         23      25
      numa_local              71299   73138
      numa_other              63      59
      numa_pages_migrated     2       712
      numa_pte_updates        0       24021
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,606,824    8,451,543
      migrations                155,352      202,804
      faults                    301,409      310,024
      cache-misses              157,759,224  253,522,507
      sched:sched_move_numa     168          213
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     3            2
      migrate:mm_migrate_pages  125          88
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        4650    11830
      numa_hint_faults_local  3946    11301
      numa_hit                90489   90038
      numa_huge_pte_updates   0       0
      numa_interleave         892     855
      numa_local              90034   89796
      numa_other              455     242
      numa_pages_migrated     124     88
      numa_pte_updates        4818    12039
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,113,167  2,049,153
      migrations                10,533     11,405
      faults                    142,727    162,309
      cache-misses              5,594,192  7,203,343
      sched:sched_move_numa     10         22
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  6          1
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        744     1693
      numa_hint_faults_local  584     1669
      numa_hit                25551   25177
      numa_huge_pte_updates   0       0
      numa_interleave         263     194
      numa_local              25302   24993
      numa_other              249     184
      numa_pages_migrated     6       1
      numa_pte_updates        744     1577
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        101,227,352      94,515,937
      migrations                4,151,829        4,203,554
      faults                    745,233          832,697
      cache-misses              224,669,561,766  226,248,698,331
      sched:sched_move_numa     617              1,730
      sched:sched_stick_numa    2                14
      sched:sched_swap_numa     187              432
      migrate:mm_migrate_pages  316              1,398
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        24195   80079
      numa_hint_faults_local  21639   68620
      numa_hit                238331  241187
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              238331  241186
      numa_other              0       1
      numa_pages_migrated     204     1347
      numa_pte_updates        24561   80729
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        62,738,978      63,704,961
      migrations                562,702         573,404
      faults                    228,465         230,878
      cache-misses              75,778,067,952  76,568,222,781
      sched:sched_move_numa     648             509
      sched:sched_stick_numa    13              31
      sched:sched_swap_numa     137             182
      migrate:mm_migrate_pages  733             541
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        10281   8501
      numa_hint_faults_local  3242    2960
      numa_hit                36338   35526
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36338   35526
      numa_other              0       0
      numa_pages_migrated     706     539
      numa_pte_updates        10176   8433
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f9672ba
    • S
      sched/numa: Pass destination CPU as a parameter to migrate_task_rq · 1327237a
      Srikar Dronamraju 提交于
      This additional parameter (new_cpu) is used later for identifying if
      task migration is across nodes.
      
      No functional change.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203353  200668   -1.32036
      1     328205  321791   -1.95427
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     214384  204848   -4.44809
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188553  188098   -0.241311
      1     196273  200351   2.07772
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     57581.2  58145.9  0.980702
      1     103468   103798   0.318939
      
      Brings out the variance between different specjbb2005 runs.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,941,377      13,912,183
      migrations                1,157,323       1,155,931
      faults                    382,175         367,139
      cache-misses              54,993,823,500  54,240,196,814
      sched:sched_move_numa     2,005           1,571
      sched:sched_stick_numa    14              9
      sched:sched_swap_numa     529             463
      migrate:mm_migrate_pages  1,573           703
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        67099   50155
      numa_hint_faults_local  58456   45264
      numa_hit                240416  239652
      numa_huge_pte_updates   18      36
      numa_interleave         65      68
      numa_local              240339  239576
      numa_other              77      76
      numa_pages_migrated     1574    680
      numa_pte_updates        77182   71146
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,176,453       3,156,720
      migrations                30,238          30,354
      faults                    87,869          97,261
      cache-misses              12,544,479,391  12,400,026,826
      sched:sched_move_numa     23              4
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     6               1
      migrate:mm_migrate_pages  10              20
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        236     272
      numa_hint_faults_local  201     186
      numa_hit                72293   71362
      numa_huge_pte_updates   0       0
      numa_interleave         26      23
      numa_local              72233   71299
      numa_other              60      63
      numa_pages_migrated     8       2
      numa_pte_updates        0       0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,478,820    8,606,824
      migrations                171,323      155,352
      faults                    307,499      301,409
      cache-misses              240,353,599  157,759,224
      sched:sched_move_numa     214          168
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     4            3
      migrate:mm_migrate_pages  89           125
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        5301    4650
      numa_hint_faults_local  4745    3946
      numa_hit                92943   90489
      numa_huge_pte_updates   0       0
      numa_interleave         899     892
      numa_local              92345   90034
      numa_other              598     455
      numa_pages_migrated     88      124
      numa_pte_updates        5505    4818
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,066,172   2,113,167
      migrations                11,076      10,533
      faults                    149,544     142,727
      cache-misses              10,398,067  5,594,192
      sched:sched_move_numa     43          10
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           0
      migrate:mm_migrate_pages  6           6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3552    744
      numa_hint_faults_local  3347    584
      numa_hit                25611   25551
      numa_huge_pte_updates   0       0
      numa_interleave         213     263
      numa_local              25583   25302
      numa_other              28      249
      numa_pages_migrated     6       6
      numa_pte_updates        3535    744
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        99,358,136       101,227,352
      migrations                4,041,607        4,151,829
      faults                    749,653          745,233
      cache-misses              225,562,543,251  224,669,561,766
      sched:sched_move_numa     771              617
      sched:sched_stick_numa    14               2
      sched:sched_swap_numa     204              187
      migrate:mm_migrate_pages  1,180            316
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        27409   24195
      numa_hint_faults_local  20677   21639
      numa_hit                239988  238331
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              239983  238331
      numa_other              5       0
      numa_pages_migrated     1016    204
      numa_pte_updates        27916   24561
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        60,899,307      62,738,978
      migrations                544,668         562,702
      faults                    270,834         228,465
      cache-misses              74,543,455,635  75,778,067,952
      sched:sched_move_numa     735             648
      sched:sched_stick_numa    25              13
      sched:sched_swap_numa     174             137
      migrate:mm_migrate_pages  816             733
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        11059   10281
      numa_hint_faults_local  4733    3242
      numa_hit                41384   36338
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              41383   36338
      numa_other              1       0
      numa_pages_migrated     815     706
      numa_pte_updates        11323   10176
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1327237a
    • S
      sched/numa: Stop multiple tasks from moving to the CPU at the same time · a4739eca
      Srikar Dronamraju 提交于
      Task migration under NUMA balancing can happen in parallel. More than
      one task might choose to migrate to the same CPU at the same time. This
      can result in:
      
      - During task swap, choosing a task that was not part of the evaluation.
      - During task swap, task which just got moved into its preferred node,
        moving to a completely different node.
      - During task swap, task failing to move to the preferred node, will have
        to wait an extra interval for the next migrate opportunity.
      - During task movement, multiple task movements can cause load imbalance.
      
      This problem is more likely if there are more cores per node or more
      nodes in the system.
      
      Use a per run-queue variable to check if NUMA-balance is active on the
      run-queue.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200194  203353   1.57797
      1     311331  328205   5.41995
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     197654  214384   8.46429
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     192605  188553   -2.10379
      1     213402  196273   -8.02664
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     52227.1  57581.2  10.2516
      1     102529   103468   0.915838
      
      There is a regression on power 9 box. If we look at the details,
      that box has a sudden jump in cache-misses with this patch.
      All other parameters seem to be pointing towards NUMA
      consolidation.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,345,784      13,941,377
      migrations                1,127,820       1,157,323
      faults                    374,736         382,175
      cache-misses              55,132,054,603  54,993,823,500
      sched:sched_move_numa     1,923           2,005
      sched:sched_stick_numa    52              14
      sched:sched_swap_numa     595             529
      migrate:mm_migrate_pages  1,932           1,573
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        60605   67099
      numa_hint_faults_local  51804   58456
      numa_hit                239945  240416
      numa_huge_pte_updates   14      18
      numa_interleave         60      65
      numa_local              239865  240339
      numa_other              80      77
      numa_pages_migrated     1931    1574
      numa_pte_updates        67823   77182
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,016,467       3,176,453
      migrations                37,326          30,238
      faults                    115,342         87,869
      cache-misses              11,692,155,554  12,544,479,391
      sched:sched_move_numa     965             23
      sched:sched_stick_numa    8               0
      sched:sched_swap_numa     35              6
      migrate:mm_migrate_pages  1,168           10
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        16286   236
      numa_hint_faults_local  11863   201
      numa_hit                112482  72293
      numa_huge_pte_updates   33      0
      numa_interleave         20      26
      numa_local              112419  72233
      numa_other              63      60
      numa_pages_migrated     1144    8
      numa_pte_updates        32859   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,629,724    8,478,820
      migrations                221,052      171,323
      faults                    308,661      307,499
      cache-misses              135,574,913  240,353,599
      sched:sched_move_numa     147          214
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            4
      migrate:mm_migrate_pages  64           89
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11481   5301
      numa_hint_faults_local  10968   4745
      numa_hit                89773   92943
      numa_huge_pte_updates   0       0
      numa_interleave         1116    899
      numa_local              89220   92345
      numa_other              553     598
      numa_pages_migrated     62      88
      numa_pte_updates        11694   5505
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,272,887  2,066,172
      migrations                12,206     11,076
      faults                    163,704    149,544
      cache-misses              4,801,186  10,398,067
      sched:sched_move_numa     44         43
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  17         6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        2261    3552
      numa_hint_faults_local  1993    3347
      numa_hit                25726   25611
      numa_huge_pte_updates   0       0
      numa_interleave         239     213
      numa_local              25498   25583
      numa_other              228     28
      numa_pages_migrated     17      6
      numa_pte_updates        2266    3535
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        117,980,962      99,358,136
      migrations                3,950,220        4,041,607
      faults                    736,979          749,653
      cache-misses              224,976,072,879  225,562,543,251
      sched:sched_move_numa     504              771
      sched:sched_stick_numa    50               14
      sched:sched_swap_numa     239              204
      migrate:mm_migrate_pages  1,260            1,180
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        18293   27409
      numa_hint_faults_local  11969   20677
      numa_hit                240854  239988
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240851  239983
      numa_other              3       5
      numa_pages_migrated     1190    1016
      numa_pte_updates        18106   27916
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        61,053,158      60,899,307
      migrations                551,586         544,668
      faults                    244,174         270,834
      cache-misses              74,326,766,973  74,543,455,635
      sched:sched_move_numa     344             735
      sched:sched_stick_numa    24              25
      sched:sched_swap_numa     140             174
      migrate:mm_migrate_pages  568             816
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        6461    11059
      numa_hint_faults_local  2283    4733
      numa_hit                35661   41384
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35661   41383
      numa_other              0       1
      numa_pages_migrated     568     815
      numa_pte_updates        6518    11323
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4739eca
  23. 10 9月, 2018 5 次提交
  24. 25 7月, 2018 4 次提交
    • S
      sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() · b6a60cf3
      Srikar Dronamraju 提交于
      numa_migrate_preferred() is called periodically or when task preferred
      node changes. Preferred node evaluations happen once per scan sequence.
      
      If the scan completion happens just after the periodic NUMA migration,
      then we try to migrate to the preferred node and the preferred node might
      change, needing another node migration.
      
      Avoid this by checking for scan sequence completion only when checking
      for periodic migration.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25862.6     26158.1     1.14258
      1     74357       72725       -2.19482
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     117019      113992      -2.58
      1     179095      174947      -2.31
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      449.46      770.77      615.22      101.70
      numa01.sh       Sys:      132.72      208.17      170.46       24.96
      numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
      numa02.sh      Real:       60.85       61.79       61.28        0.37
      numa02.sh       Sys:       15.34       24.71       21.08        3.61
      numa02.sh      User:     5204.41     5249.85     5231.21       17.60
      numa03.sh      Real:      785.50      916.97      840.77       44.98
      numa03.sh       Sys:      108.08      133.60      119.43        8.82
      numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
      numa04.sh      Real:      429.57      587.37      480.80       57.40
      numa04.sh       Sys:      240.61      321.97      290.84       33.58
      numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
      numa05.sh      Real:      392.09      431.25      414.65       13.82
      numa05.sh       Sys:      229.41      372.48      297.54       53.14
      numa05.sh      User:    33390.86    34697.49    34222.43      556.42
      
      Testcase       Time:         Min         Max         Avg      StdDev 	%Change
      numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
      numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
      numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
      numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
      numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
      numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
      numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
      numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
      numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
      numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
      numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
      numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
      numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
      numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
      numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6a60cf3
    • S
      sched/numa: Use group_weights to identify if migration degrades locality · f35678b6
      Srikar Dronamraju 提交于
      On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
      consolidated to the closest group of nodes. In such a case, relying on
      group_fault metric may not always help to consolidate. There can always
      be a case where a node closer to the preferred node may have lesser
      faults than a node further away from the preferred node. In such a case,
      moving to node with more faults might avoid numa consolidation.
      
      Using group_weight would help to consolidate task/memory around the
      preferred_node.
      
      While here, to be on the conservative side, don't override migrate thread
      degrades locality logic for CPU_NEWLY_IDLE load balancing.
      
      Note: Similar problems exist with should_numa_migrate_memory and will be
      dealt separately.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25645.4     25960       1.22
      1     72142       73550       1.95
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     110199      120071      8.958
      1     176303      176249      -0.03
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      490.04      774.86      596.26       96.46
      numa01.sh       Sys:      151.52      242.88      184.82       31.71
      numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
      numa02.sh      Real:       60.14       62.94       60.98        1.00
      numa02.sh       Sys:       16.11       30.77       21.20        5.28
      numa02.sh      User:     5184.33     5311.09     5228.50       44.24
      numa03.sh      Real:      790.95      856.35      826.41       24.11
      numa03.sh       Sys:      114.93      118.85      117.05        1.63
      numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
      numa04.sh      Real:      434.37      597.92      504.87       59.70
      numa04.sh       Sys:      237.63      397.40      289.74       55.98
      numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
      numa05.sh      Real:      386.77      448.90      417.22       22.79
      numa05.sh       Sys:      149.23      379.95      303.04       79.55
      numa05.sh      User:    32951.76    35959.58    34562.18     1034.05
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
      numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
      numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
      numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
      numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
      numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
      numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
      numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
      numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
      numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
      numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
      numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
      numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
      numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
      numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f35678b6
    • S
      sched/numa: Update the scan period without holding the numa_group lock · 30619c89
      Srikar Dronamraju 提交于
      The metrics for updating scan periods are local or task specific.
      Currently this update happens under the numa_group lock, which seems
      unnecessary. Hence move this update outside the lock.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25355.9     25645.4     1.141
      1     72812       72142       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      30619c89
    • S
      sched/numa: Remove numa_has_capacity() · 2d4056fa
      Srikar Dronamraju 提交于
      task_numa_find_cpu() helps to find the CPU to swap/move the task to.
      It's guarded by numa_has_capacity(). However node not having capacity
      shouldn't deter a task swapping if it helps NUMA placement.
      
      Further load_too_imbalanced(), which evaluates possibilities of move/swap,
      provides similar checks as numa_has_capacity.
      
      Hence remove numa_has_capacity() to enhance possibilities of task
      swapping even if load is imbalanced.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25657.9     25804.1     0.569
      1     74435       73413       -1.37
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4056fa