1. 06 4月, 2019 2 次提交
  2. 06 3月, 2019 1 次提交
    • P
      sched/wake_q: Fix wakeup ordering for wake_q · 653a1dbc
      Peter Zijlstra 提交于
      [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      Reported-by: NYongji Xie <elohimes@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      653a1dbc
  3. 13 2月, 2019 1 次提交
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
  4. 13 1月, 2019 1 次提交
    • L
      sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654 · dc8408ea
      Linus Torvalds 提交于
      commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.
      
      Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
      scheduler under high loads, starting at around the v4.18 time frame,
      and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
      manipulation.
      
      Do a (manual) revert of:
      
        a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      
      It turns out that the list_del_leaf_cfs_rq() introduced by this commit
      is a surprising property that was not considered in followup commits
      such as:
      
        9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
      
      As Vincent Guittot explains:
      
       "I think that there is a bigger problem with commit a9e7f654 and
        cfs_rq throttling:
      
        Let take the example of the following topology TG2 --> TG1 --> root:
      
         1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
            cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
            one path because it has never been used and can't be throttled so
            tmp_alone_branch will point to leaf_cfs_rq_list at the end.
      
         2) Then TG1 is throttled
      
         3) and we add TG3 as a new child of TG1.
      
         4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
            cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
      
        With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list.
        So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
        cfs_rq is removed from the list.
        Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
        but tmp_alone_branch still points to TG3 cfs_rq because its throttled
        parent can't be enqueued when the lock is released.
        tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
      
        So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
        points on another TG cfs_rq, the next TG cfs_rq that will be added,
        will be linked outside rq->leaf_cfs_rq_list - which is bad.
      
        In addition, we can break the ordering of the cfs_rq in
        rq->leaf_cfs_rq_list but this ordering is used to update and
        propagate the update from leaf down to root."
      
      Instead of trying to work through all these cases and trying to reproduce
      the very high loads that produced the lockup to begin with, simplify
      the code temporarily by reverting a9e7f654 - which change was clearly
      not thought through completely.
      
      This (hopefully) gives us a kernel that doesn't lock up so people
      can continue to enjoy their holidays without worrying about regressions. ;-)
      
      [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
      Analyzed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Analyzed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Reported-by: NSargun Dhillon <sargun@sargun.me>
      Reported-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Tested-by: NSargun Dhillon <sargun@sargun.me>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.13+
      Cc: Bin Li <huawei.libin@huawei.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc8408ea
  5. 20 12月, 2018 1 次提交
  6. 06 12月, 2018 2 次提交
    • T
      sched/smt: Expose sched_smt_present static key · 340693ee
      Thomas Gleixner 提交于
      commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream
      
      Make the scheduler's 'sched_smt_present' static key globaly available, so
      it can be used in the x86 speculation control code.
      
      Provide a query function and a stub for the CONFIG_SMP=n case.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340693ee
    • P
      sched/smt: Make sched_smt_present track topology · a2c09481
      Peter Zijlstra (Intel) 提交于
      commit c5511d03ec090980732e929c318a7a6374b5550e upstream
      
      Currently the 'sched_smt_present' static key is enabled when at CPU bringup
      SMT topology is observed, but it is never disabled. However there is demand
      to also disable the key when the topology changes such that there is no SMT
      present anymore.
      
      Implement this by making the key count the number of cores that have SMT
      enabled.
      
      In particular, the SMT topology bits are set before interrrupts are enabled
      and similarly, are cleared after interrupts are disabled for the last time
      and the CPU dies.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2c09481
  7. 01 12月, 2018 1 次提交
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  8. 27 11月, 2018 1 次提交
    • V
      sched/core: Take the hotplug lock in sched_init_smp() · b1e814e4
      Valentin Schneider 提交于
      [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]
      
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1e814e4
  9. 16 10月, 2018 1 次提交
  10. 11 10月, 2018 1 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
  11. 02 10月, 2018 6 次提交
    • M
      sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task · 37355bdc
      Mel Gorman 提交于
      Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
      should migrate to a local node. This filter avoids excessive ping-ponging
      if a page is shared or used by threads that migrate cross-node frequently.
      
      Threads inherit both page tables and the preferred node ID from the
      parent. This means that threads can trigger hinting faults earlier than
      a new task which delays scanning for a number of seconds. As it can be
      load balanced very early in its lifetime there can be an unnecessary delay
      before it starts migrating thread-local data. This patch migrates private
      pages faster early in the lifetime of a thread using the sequence counter
      as an identifier of new tasks.
      
      With this patch applied, STREAM performance is the same as 4.17 even though
      processes are not spread cross-node prematurely. Other workloads showed
      a mix of minor gains and losses. This is somewhat expected most workloads
      are not very sensitive to the starting conditions of a process.
      
                               4.19.0-rc5             4.19.0-rc5                 4.17.0
                               numab-v1r1       fastmigrate-v1r1                vanilla
      MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
      MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
      MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
      MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37355bdc
    • S
      sched/numa: Avoid task migration for small NUMA improvement · 6fd98e77
      Srikar Dronamraju 提交于
      If NUMA improvement from the task migration is going to be very
      minimal, then avoid task migration.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     198512  205910   3.72673
      1     313559  318491   1.57291
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev     Current  %Change
      8     74761.9  74935.9  0.232739
      1     214874   226796   5.54837
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     180536  189780   5.12031
      1     210281  205695   -2.18089
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     56511.4  60370    6.828
      1     104899   108100   3.05151
      
      1/7 cases is regressing, if we look at events migrate_pages seem
      to vary the most especially in the regressing case. Also some
      amount of variance is expected between different runs of
      Specjbb2005.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,818,546      13,801,554
      migrations                1,149,960       1,151,541
      faults                    385,583         433,246
      cache-misses              55,259,546,768  55,168,691,835
      sched:sched_move_numa     2,257           2,551
      sched:sched_stick_numa    9               24
      sched:sched_swap_numa     512             904
      migrate:mm_migrate_pages  2,225           1,571
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        72692   113682
      numa_hint_faults_local  62270   102163
      numa_hit                238762  240181
      numa_huge_pte_updates   48      36
      numa_interleave         75      64
      numa_local              238676  240103
      numa_other              86      78
      numa_pages_migrated     2225    1564
      numa_pte_updates        98557   134080
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,173,490       3,079,150
      migrations                36,966          31,455
      faults                    108,776         99,081
      cache-misses              12,200,075,320  11,588,126,740
      sched:sched_move_numa     1,264           1
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     0               0
      migrate:mm_migrate_pages  899             36
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        21109   430
      numa_hint_faults_local  17120   77
      numa_hit                72934   71277
      numa_huge_pte_updates   42      0
      numa_interleave         33      22
      numa_local              72866   71218
      numa_other              68      59
      numa_pages_migrated     915     23
      numa_pte_updates        42326   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,312,022    8,707,565
      migrations                231,705      171,342
      faults                    310,242      310,820
      cache-misses              402,324,573  136,115,400
      sched:sched_move_numa     193          215
      sched:sched_stick_numa    0            6
      sched:sched_swap_numa     3            24
      migrate:mm_migrate_pages  93           162
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11838   8985
      numa_hint_faults_local  11216   8154
      numa_hit                90689   93819
      numa_huge_pte_updates   0       0
      numa_interleave         1579    882
      numa_local              89634   93496
      numa_other              1055    323
      numa_pages_migrated     92      169
      numa_pte_updates        12109   9217
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,170,481   2,152,072
      migrations                10,126      10,704
      faults                    160,962     164,376
      cache-misses              10,834,845  3,818,437
      sched:sched_move_numa     10          16
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           7
      migrate:mm_migrate_pages  2           199
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        403     2248
      numa_hint_faults_local  358     1666
      numa_hit                25898   25704
      numa_huge_pte_updates   0       0
      numa_interleave         207     200
      numa_local              25860   25679
      numa_other              38      25
      numa_pages_migrated     2       197
      numa_pte_updates        400     2234
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        110,339,633      93,330,595
      migrations                4,139,812        4,122,061
      faults                    863,622          865,979
      cache-misses              231,838,045,660  225,395,083,479
      sched:sched_move_numa     2,196            2,372
      sched:sched_stick_numa    33               24
      sched:sched_swap_numa     544              769
      migrate:mm_migrate_pages  2,469            1,677
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        85748   91638
      numa_hint_faults_local  66831   78096
      numa_hit                242213  242225
      numa_huge_pte_updates   0       0
      numa_interleave         0       2
      numa_local              242211  242219
      numa_other              2       6
      numa_pages_migrated     2376    1515
      numa_pte_updates        86233   92274
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        59,331,057      51,487,271
      migrations                552,019         537,170
      faults                    266,586         256,921
      cache-misses              73,796,312,990  70,073,831,187
      sched:sched_move_numa     981             576
      sched:sched_stick_numa    54              24
      sched:sched_swap_numa     286             327
      migrate:mm_migrate_pages  713             726
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        14807   12000
      numa_hint_faults_local  5738    5024
      numa_hit                36230   36470
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36228   36465
      numa_other              2       5
      numa_pages_migrated     703     726
      numa_pte_updates        14742   11930
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fd98e77
    • M
      sched/numa: Limit the conditions where scan period is reset · 05cbdf4f
      Mel Gorman 提交于
      migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
      cross-node migration. In the event of excessive load balancing due to
      saturation, this may result in the scan rate being pegged at maximum and
      further overloading the machine.
      
      This patch only resets the scan if NUMA balancing is active, a preferred
      node has been selected and the task is being migrated from the preferred
      node as these are the most harmful. For example, a migration to the preferred
      node does not justify a faster scan rate. Similarly, a migration between two
      nodes that are not preferred is probably bouncing due to over-saturation of
      the machine.  In that case, scanning faster and trapping more NUMA faults
      will further overload the machine.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203370  205332   0.964744
      1     328431  319785   -2.63252
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     206070  206585   0.249915
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188386  189162   0.41192
      1     201566  213760   6.04963
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     59157.4  58736.8  -0.710985
      1     105495   105419   -0.0720413
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,825,492      14,285,708
      migrations                1,152,509       1,180,621
      faults                    371,948         339,114
      cache-misses              55,654,206,041  55,205,631,894
      sched:sched_move_numa     1,856           843
      sched:sched_stick_numa    4               6
      sched:sched_swap_numa     428             219
      migrate:mm_migrate_pages  898             365
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        57146   26907
      numa_hint_faults_local  51612   24279
      numa_hit                238164  239771
      numa_huge_pte_updates   16      0
      numa_interleave         63      68
      numa_local              238085  239688
      numa_other              79      83
      numa_pages_migrated     883     363
      numa_pte_updates        67540   27415
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,288,525       3,202,779
      migrations                38,652          37,186
      faults                    111,678         106,076
      cache-misses              12,111,197,376  12,024,873,744
      sched:sched_move_numa     900             931
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     5               1
      migrate:mm_migrate_pages  714             637
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        18572   17409
      numa_hint_faults_local  14850   14367
      numa_hit                73197   73953
      numa_huge_pte_updates   11      20
      numa_interleave         25      25
      numa_local              73138   73892
      numa_other              59      61
      numa_pages_migrated     712     668
      numa_pte_updates        24021   27276
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,451,543    8,474,013
      migrations                202,804      254,934
      faults                    310,024      320,506
      cache-misses              253,522,507  110,580,458
      sched:sched_move_numa     213          725
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            7
      migrate:mm_migrate_pages  88           145
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11830   22797
      numa_hint_faults_local  11301   21539
      numa_hit                90038   89308
      numa_huge_pte_updates   0       0
      numa_interleave         855     865
      numa_local              89796   88955
      numa_other              242     353
      numa_pages_migrated     88      149
      numa_pte_updates        12039   22930
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,049,153  2,195,628
      migrations                11,405     11,179
      faults                    162,309    149,656
      cache-misses              7,203,343  8,117,515
      sched:sched_move_numa     22         49
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  1          5
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        1693    3577
      numa_hint_faults_local  1669    3476
      numa_hit                25177   26142
      numa_huge_pte_updates   0       0
      numa_interleave         194     358
      numa_local              24993   26042
      numa_other              184     100
      numa_pages_migrated     1       5
      numa_pte_updates        1577    3587
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        94,515,937       100,602,296
      migrations                4,203,554        4,135,630
      faults                    832,697          789,256
      cache-misses              226,248,698,331  226,160,621,058
      sched:sched_move_numa     1,730            1,366
      sched:sched_stick_numa    14               16
      sched:sched_swap_numa     432              374
      migrate:mm_migrate_pages  1,398            1,350
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        80079   47857
      numa_hint_faults_local  68620   39768
      numa_hit                241187  240165
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              241186  240165
      numa_other              1       0
      numa_pages_migrated     1347    1224
      numa_pte_updates        80729   48354
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        63,704,961      58,515,496
      migrations                573,404         564,845
      faults                    230,878         245,807
      cache-misses              76,568,222,781  73,603,757,976
      sched:sched_move_numa     509             996
      sched:sched_stick_numa    31              10
      sched:sched_swap_numa     182             193
      migrate:mm_migrate_pages  541             646
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        8501    13422
      numa_hint_faults_local  2960    5619
      numa_hit                35526   36118
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35526   36116
      numa_other              0       2
      numa_pages_migrated     539     616
      numa_pte_updates        8433    13374
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05cbdf4f
    • S
      sched/numa: Reset scan rate whenever task moves across nodes · 3f9672ba
      Srikar Dronamraju 提交于
      Currently task scan rate is reset when NUMA balancer migrates the task
      to a different node. If NUMA balancer initiates a swap, reset is only
      applicable to the task that initiates the swap. Similarly no scan rate
      reset is done if the task is migrated across nodes by traditional load
      balancer.
      
      Instead move the scan reset to the migrate_task_rq. This ensures the
      task moved out of its preferred node, either gets back to its preferred
      node quickly or finds a new preferred node. Doing so, would be fair to
      all tasks migrating across nodes.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200668  203370   1.3465
      1     321791  328431   2.06345
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     204848  206070   0.59654
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188098  188386   0.153112
      1     200351  201566   0.606436
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58145.9  59157.4  1.73959
      1     103798   105495   1.63491
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,912,183      13,825,492
      migrations                1,155,931       1,152,509
      faults                    367,139         371,948
      cache-misses              54,240,196,814  55,654,206,041
      sched:sched_move_numa     1,571           1,856
      sched:sched_stick_numa    9               4
      sched:sched_swap_numa     463             428
      migrate:mm_migrate_pages  703             898
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        50155   57146
      numa_hint_faults_local  45264   51612
      numa_hit                239652  238164
      numa_huge_pte_updates   36      16
      numa_interleave         68      63
      numa_local              239576  238085
      numa_other              76      79
      numa_pages_migrated     680     883
      numa_pte_updates        71146   67540
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,156,720       3,288,525
      migrations                30,354          38,652
      faults                    97,261          111,678
      cache-misses              12,400,026,826  12,111,197,376
      sched:sched_move_numa     4               900
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               5
      migrate:mm_migrate_pages  20              714
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        272     18572
      numa_hint_faults_local  186     14850
      numa_hit                71362   73197
      numa_huge_pte_updates   0       11
      numa_interleave         23      25
      numa_local              71299   73138
      numa_other              63      59
      numa_pages_migrated     2       712
      numa_pte_updates        0       24021
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,606,824    8,451,543
      migrations                155,352      202,804
      faults                    301,409      310,024
      cache-misses              157,759,224  253,522,507
      sched:sched_move_numa     168          213
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     3            2
      migrate:mm_migrate_pages  125          88
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        4650    11830
      numa_hint_faults_local  3946    11301
      numa_hit                90489   90038
      numa_huge_pte_updates   0       0
      numa_interleave         892     855
      numa_local              90034   89796
      numa_other              455     242
      numa_pages_migrated     124     88
      numa_pte_updates        4818    12039
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,113,167  2,049,153
      migrations                10,533     11,405
      faults                    142,727    162,309
      cache-misses              5,594,192  7,203,343
      sched:sched_move_numa     10         22
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  6          1
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        744     1693
      numa_hint_faults_local  584     1669
      numa_hit                25551   25177
      numa_huge_pte_updates   0       0
      numa_interleave         263     194
      numa_local              25302   24993
      numa_other              249     184
      numa_pages_migrated     6       1
      numa_pte_updates        744     1577
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        101,227,352      94,515,937
      migrations                4,151,829        4,203,554
      faults                    745,233          832,697
      cache-misses              224,669,561,766  226,248,698,331
      sched:sched_move_numa     617              1,730
      sched:sched_stick_numa    2                14
      sched:sched_swap_numa     187              432
      migrate:mm_migrate_pages  316              1,398
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        24195   80079
      numa_hint_faults_local  21639   68620
      numa_hit                238331  241187
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              238331  241186
      numa_other              0       1
      numa_pages_migrated     204     1347
      numa_pte_updates        24561   80729
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        62,738,978      63,704,961
      migrations                562,702         573,404
      faults                    228,465         230,878
      cache-misses              75,778,067,952  76,568,222,781
      sched:sched_move_numa     648             509
      sched:sched_stick_numa    13              31
      sched:sched_swap_numa     137             182
      migrate:mm_migrate_pages  733             541
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        10281   8501
      numa_hint_faults_local  3242    2960
      numa_hit                36338   35526
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36338   35526
      numa_other              0       0
      numa_pages_migrated     706     539
      numa_pte_updates        10176   8433
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f9672ba
    • S
      sched/numa: Pass destination CPU as a parameter to migrate_task_rq · 1327237a
      Srikar Dronamraju 提交于
      This additional parameter (new_cpu) is used later for identifying if
      task migration is across nodes.
      
      No functional change.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203353  200668   -1.32036
      1     328205  321791   -1.95427
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     214384  204848   -4.44809
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188553  188098   -0.241311
      1     196273  200351   2.07772
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     57581.2  58145.9  0.980702
      1     103468   103798   0.318939
      
      Brings out the variance between different specjbb2005 runs.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,941,377      13,912,183
      migrations                1,157,323       1,155,931
      faults                    382,175         367,139
      cache-misses              54,993,823,500  54,240,196,814
      sched:sched_move_numa     2,005           1,571
      sched:sched_stick_numa    14              9
      sched:sched_swap_numa     529             463
      migrate:mm_migrate_pages  1,573           703
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        67099   50155
      numa_hint_faults_local  58456   45264
      numa_hit                240416  239652
      numa_huge_pte_updates   18      36
      numa_interleave         65      68
      numa_local              240339  239576
      numa_other              77      76
      numa_pages_migrated     1574    680
      numa_pte_updates        77182   71146
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,176,453       3,156,720
      migrations                30,238          30,354
      faults                    87,869          97,261
      cache-misses              12,544,479,391  12,400,026,826
      sched:sched_move_numa     23              4
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     6               1
      migrate:mm_migrate_pages  10              20
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        236     272
      numa_hint_faults_local  201     186
      numa_hit                72293   71362
      numa_huge_pte_updates   0       0
      numa_interleave         26      23
      numa_local              72233   71299
      numa_other              60      63
      numa_pages_migrated     8       2
      numa_pte_updates        0       0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,478,820    8,606,824
      migrations                171,323      155,352
      faults                    307,499      301,409
      cache-misses              240,353,599  157,759,224
      sched:sched_move_numa     214          168
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     4            3
      migrate:mm_migrate_pages  89           125
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        5301    4650
      numa_hint_faults_local  4745    3946
      numa_hit                92943   90489
      numa_huge_pte_updates   0       0
      numa_interleave         899     892
      numa_local              92345   90034
      numa_other              598     455
      numa_pages_migrated     88      124
      numa_pte_updates        5505    4818
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,066,172   2,113,167
      migrations                11,076      10,533
      faults                    149,544     142,727
      cache-misses              10,398,067  5,594,192
      sched:sched_move_numa     43          10
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           0
      migrate:mm_migrate_pages  6           6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3552    744
      numa_hint_faults_local  3347    584
      numa_hit                25611   25551
      numa_huge_pte_updates   0       0
      numa_interleave         213     263
      numa_local              25583   25302
      numa_other              28      249
      numa_pages_migrated     6       6
      numa_pte_updates        3535    744
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        99,358,136       101,227,352
      migrations                4,041,607        4,151,829
      faults                    749,653          745,233
      cache-misses              225,562,543,251  224,669,561,766
      sched:sched_move_numa     771              617
      sched:sched_stick_numa    14               2
      sched:sched_swap_numa     204              187
      migrate:mm_migrate_pages  1,180            316
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        27409   24195
      numa_hint_faults_local  20677   21639
      numa_hit                239988  238331
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              239983  238331
      numa_other              5       0
      numa_pages_migrated     1016    204
      numa_pte_updates        27916   24561
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        60,899,307      62,738,978
      migrations                544,668         562,702
      faults                    270,834         228,465
      cache-misses              74,543,455,635  75,778,067,952
      sched:sched_move_numa     735             648
      sched:sched_stick_numa    25              13
      sched:sched_swap_numa     174             137
      migrate:mm_migrate_pages  816             733
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        11059   10281
      numa_hint_faults_local  4733    3242
      numa_hit                41384   36338
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              41383   36338
      numa_other              1       0
      numa_pages_migrated     815     706
      numa_pte_updates        11323   10176
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1327237a
    • S
      sched/numa: Stop multiple tasks from moving to the CPU at the same time · a4739eca
      Srikar Dronamraju 提交于
      Task migration under NUMA balancing can happen in parallel. More than
      one task might choose to migrate to the same CPU at the same time. This
      can result in:
      
      - During task swap, choosing a task that was not part of the evaluation.
      - During task swap, task which just got moved into its preferred node,
        moving to a completely different node.
      - During task swap, task failing to move to the preferred node, will have
        to wait an extra interval for the next migrate opportunity.
      - During task movement, multiple task movements can cause load imbalance.
      
      This problem is more likely if there are more cores per node or more
      nodes in the system.
      
      Use a per run-queue variable to check if NUMA-balance is active on the
      run-queue.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200194  203353   1.57797
      1     311331  328205   5.41995
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     197654  214384   8.46429
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     192605  188553   -2.10379
      1     213402  196273   -8.02664
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     52227.1  57581.2  10.2516
      1     102529   103468   0.915838
      
      There is a regression on power 9 box. If we look at the details,
      that box has a sudden jump in cache-misses with this patch.
      All other parameters seem to be pointing towards NUMA
      consolidation.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,345,784      13,941,377
      migrations                1,127,820       1,157,323
      faults                    374,736         382,175
      cache-misses              55,132,054,603  54,993,823,500
      sched:sched_move_numa     1,923           2,005
      sched:sched_stick_numa    52              14
      sched:sched_swap_numa     595             529
      migrate:mm_migrate_pages  1,932           1,573
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        60605   67099
      numa_hint_faults_local  51804   58456
      numa_hit                239945  240416
      numa_huge_pte_updates   14      18
      numa_interleave         60      65
      numa_local              239865  240339
      numa_other              80      77
      numa_pages_migrated     1931    1574
      numa_pte_updates        67823   77182
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,016,467       3,176,453
      migrations                37,326          30,238
      faults                    115,342         87,869
      cache-misses              11,692,155,554  12,544,479,391
      sched:sched_move_numa     965             23
      sched:sched_stick_numa    8               0
      sched:sched_swap_numa     35              6
      migrate:mm_migrate_pages  1,168           10
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        16286   236
      numa_hint_faults_local  11863   201
      numa_hit                112482  72293
      numa_huge_pte_updates   33      0
      numa_interleave         20      26
      numa_local              112419  72233
      numa_other              63      60
      numa_pages_migrated     1144    8
      numa_pte_updates        32859   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,629,724    8,478,820
      migrations                221,052      171,323
      faults                    308,661      307,499
      cache-misses              135,574,913  240,353,599
      sched:sched_move_numa     147          214
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            4
      migrate:mm_migrate_pages  64           89
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11481   5301
      numa_hint_faults_local  10968   4745
      numa_hit                89773   92943
      numa_huge_pte_updates   0       0
      numa_interleave         1116    899
      numa_local              89220   92345
      numa_other              553     598
      numa_pages_migrated     62      88
      numa_pte_updates        11694   5505
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,272,887  2,066,172
      migrations                12,206     11,076
      faults                    163,704    149,544
      cache-misses              4,801,186  10,398,067
      sched:sched_move_numa     44         43
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  17         6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        2261    3552
      numa_hint_faults_local  1993    3347
      numa_hit                25726   25611
      numa_huge_pte_updates   0       0
      numa_interleave         239     213
      numa_local              25498   25583
      numa_other              228     28
      numa_pages_migrated     17      6
      numa_pte_updates        2266    3535
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        117,980,962      99,358,136
      migrations                3,950,220        4,041,607
      faults                    736,979          749,653
      cache-misses              224,976,072,879  225,562,543,251
      sched:sched_move_numa     504              771
      sched:sched_stick_numa    50               14
      sched:sched_swap_numa     239              204
      migrate:mm_migrate_pages  1,260            1,180
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        18293   27409
      numa_hint_faults_local  11969   20677
      numa_hit                240854  239988
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              240851  239983
      numa_other              3       5
      numa_pages_migrated     1190    1016
      numa_pte_updates        18106   27916
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        61,053,158      60,899,307
      migrations                551,586         544,668
      faults                    244,174         270,834
      cache-misses              74,326,766,973  74,543,455,635
      sched:sched_move_numa     344             735
      sched:sched_stick_numa    24              25
      sched:sched_swap_numa     140             174
      migrate:mm_migrate_pages  568             816
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        6461    11059
      numa_hint_faults_local  2283    4733
      numa_hit                35661   41384
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35661   41383
      numa_other              0       1
      numa_pages_migrated     568     815
      numa_pte_updates        6518    11323
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4739eca
  12. 10 9月, 2018 7 次提交
    • R
      sched/fair: Fix kernel-doc notation warning · 882a78a9
      Randy Dunlap 提交于
      Fix kernel-doc warning for missing 'flags' parameter description:
      
      ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: ea14b57e ("sched/cpufreq: Provide migration hint")
      Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      882a78a9
    • V
      sched/fair: Fix load_balance redo for !imbalance · bb3485c8
      Vincent Guittot 提交于
      It can happen that load_balance() finds a busiest group and then a
      busiest rq but the calculated imbalance is in fact 0.
      
      In such situation, detach_tasks() returns immediately and lets the
      flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
      have pinned tasks and removed from the load balance mask. then, we
      redo a load balance without the busiest CPU. This creates wrong load
      balance situation and generates wrong task migration.
      
      If the calculated imbalance is 0, it's useless to try to find a
      busiest rq as no task will be migrated and we can return immediately.
      
      This situation can happen with heterogeneous system or smp system when
      RT tasks are decreasing the capacity of some CPUs.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: jhugo@codeaurora.org
      Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bb3485c8
    • V
      sched/fair: Fix scale_rt_capacity() for SMT · 287cdaac
      Vincent Guittot 提交于
      Since commit:
      
        523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      
      scale_rt_capacity() returns the remaining capacity and not a scale factor
      to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
      scale_rt_capacity() so we must take the sched_domain argument.
      Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 523e979d ("sched/core: Use PELT for scale_rt_capacity()")
      Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      287cdaac
    • S
      sched/fair: Fix vruntime_normalized() for remote non-migration wakeup · d0cdb3ce
      Steve Muckle 提交于
      When a task which previously ran on a given CPU is remotely queued to
      wake up on that same CPU, there is a period where the task's state is
      TASK_WAKING and its vruntime is not normalized. This is not accounted
      for in vruntime_normalized() which will cause an error in the task's
      vruntime if it is switched from the fair class during this time.
      
      For example if it is boosted to RT priority via rt_mutex_setprio(),
      rq->min_vruntime will not be subtracted from the task's vruntime but
      it will be added again when the task returns to the fair class. The
      task's vruntime will have been erroneously doubled and the effective
      priority of the task will be reduced.
      
      Note this will also lead to inflation of all vruntimes since the doubled
      vruntime value will become the rq's min_vruntime when other tasks leave
      the rq. This leads to repeated doubling of the vruntime and priority
      penalty.
      
      Fix this by recognizing a WAKING task's vruntime as normalized only if
      sched_remote_wakeup is true. This indicates a migration, in which case
      the vruntime would have been normalized in migrate_task_rq_fair().
      
      Based on a similar patch from John Dias <joaodias@google.com>.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NSteve Muckle <smuckle@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel de Dios <migueldedios@google.com>
      Cc: Morten Rasmussen <Morten.Rasmussen@arm.com>
      Cc: Patrick Bellasi <Patrick.Bellasi@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: kernel-team@android.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d0cdb3ce
    • V
      sched/pelt: Fix update_blocked_averages() for RT and DL classes · 12b04875
      Vincent Guittot 提交于
      update_blocked_averages() is called to periodiccally decay the stalled load
      of idle CPUs and to sync all loads before running load balance.
      
      When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
      in order to potentially pull tasks and to use this newly idle CPU. This
      load balance happens whereas prev task from another class has not been put
      and its utilization updated yet. This may lead to wrongly account running
      time as idle time for RT or DL classes.
      
      Test that no RT or DL task is running when updating their utilization in
      update_blocked_averages().
      
      We still update RT and DL utilization instead of simply skipping them to
      make sure that all metrics are synced when used during load balance.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 371bf427 ("sched/rt: Add rt_rq utilization tracking")
      Fixes: 3727e0e1 ("sched/dl: Add dl_rq utilization tracking")
      Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      12b04875
    • S
      sched/topology: Set correct NUMA topology type · e5e96faf
      Srikar Dronamraju 提交于
      With the following commit:
      
        051f3ca0 ("sched/topology: Introduce NUMA identity node sched domain")
      
      the scheduler introduced a new NUMA level. However this leads to the NUMA topology
      on 2 node systems to not be marked as NUMA_DIRECT anymore.
      
      After this commit, it gets reported as NUMA_BACKPLANE, because
      sched_domains_numa_level is now 2 on 2 node systems.
      
      Fix this by allowing setting systems that have up to 2 NUMA levels as
      NUMA_DIRECT.
      
      While here remove code that assumes that level can be 0.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andre Wild <wild@linux.vnet.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
      Fixes: 051f3ca0 "Introduce NUMA identity node sched domain"
      Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e5e96faf
    • J
      sched/debug: Fix potential deadlock when writing to sched_features · e73e8197
      Jiada Wang 提交于
      The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.18.0-rc6-00152-gcd3f77d7-dirty #18 Not tainted
        ------------------------------------------------------
        sh/3358 is trying to acquire lock:
        000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
        but task is already holding lock:
        00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        which lock already depends on the new lock.
        the existing dependency chain (in reverse order) is:
        -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
               lock_acquire+0xb8/0x148
               down_write+0xac/0x140
               start_creating+0x5c/0x168
               debugfs_create_dir+0x18/0x220
               opp_debug_register+0x8c/0x120
               _add_opp_dev+0x104/0x1f8
               dev_pm_opp_get_opp_table+0x174/0x340
               _of_add_opp_table_v2+0x110/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #2 (opp_table_lock){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               _of_add_opp_table_v2+0xb4/0x760
               dev_pm_opp_of_add_table+0x5c/0x240
               dev_pm_opp_of_cpumask_add_table+0x5c/0x100
               cpufreq_init+0x160/0x430
               cpufreq_online+0x1cc/0xe30
               cpufreq_add_dev+0x78/0x198
               subsys_interface_register+0x168/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #1 (subsys mutex#6){+.+.}:
               lock_acquire+0xb8/0x148
               __mutex_lock+0x104/0xf50
               mutex_lock_nested+0x1c/0x28
               subsys_interface_register+0xd8/0x270
               cpufreq_register_driver+0x1c8/0x278
               dt_cpufreq_probe+0xdc/0x1b8
               platform_drv_probe+0xb4/0x168
               driver_probe_device+0x318/0x4b0
               __device_attach_driver+0xfc/0x1f0
               bus_for_each_drv+0xf8/0x180
               __device_attach+0x164/0x200
               device_initial_probe+0x10/0x18
               bus_probe_device+0x110/0x178
               device_add+0x6d8/0x908
               platform_device_add+0x138/0x3d8
               platform_device_register_full+0x1cc/0x1f8
               cpufreq_dt_platdev_init+0x174/0x1bc
               do_one_initcall+0xb8/0x310
               kernel_init_freeable+0x4b8/0x56c
               kernel_init+0x10/0x138
               ret_from_fork+0x10/0x18
        -> #0 (cpu_hotplug_lock.rw_sem){++++}:
               __lock_acquire+0x203c/0x21d0
               lock_acquire+0xb8/0x148
               cpus_read_lock+0x58/0x1c8
               static_key_enable+0x14/0x30
               sched_feat_write+0x314/0x428
               full_proxy_write+0xa0/0x138
               __vfs_write+0xd8/0x388
               vfs_write+0xdc/0x318
               ksys_write+0xb4/0x138
               sys_write+0xc/0x18
               __sys_trace_return+0x0/0x4
        other info that might help us debug this:
        Chain exists of:
          cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
         Possible unsafe locking scenario:
               CPU0                    CPU1
               ----                    ----
          lock(&sb->s_type->i_mutex_key#3);
                                       lock(opp_table_lock);
                                       lock(&sb->s_type->i_mutex_key#3);
          lock(cpu_hotplug_lock.rw_sem);
         *** DEADLOCK ***
        2 locks held by sh/3358:
         #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
         #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
        stack backtrace:
        CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d7-dirty #18
        Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
        Call trace:
         dump_backtrace+0x0/0x288
         show_stack+0x14/0x20
         dump_stack+0x13c/0x1ac
         print_circular_bug.isra.10+0x270/0x438
         check_prev_add.constprop.16+0x4dc/0xb98
         __lock_acquire+0x203c/0x21d0
         lock_acquire+0xb8/0x148
         cpus_read_lock+0x58/0x1c8
         static_key_enable+0x14/0x30
         sched_feat_write+0x314/0x428
         full_proxy_write+0xa0/0x138
         __vfs_write+0xd8/0x388
         vfs_write+0xdc/0x318
         ksys_write+0xb4/0x138
         sys_write+0xc/0x18
         __sys_trace_return+0x0/0x4
      
      This is because when loading the cpufreq_dt module we first acquire
      cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
      the &sb->s_type->i_mutex_key lock.
      
      But when writing to /sys/kernel/debug/sched_features, the
      cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.
      
      To fix this bug, reverse the lock acquisition order when writing to
      sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
      &sb->s_type->i_mutex_key.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJiada Wang <jiada_wang@mentor.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Eugeniu Rosca <erosca@de.adit-jv.com>
      Cc: George G. Davis <george_davis@mentor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e73e8197
  13. 23 8月, 2018 1 次提交
  14. 20 8月, 2018 1 次提交
  15. 04 8月, 2018 1 次提交
    • E
      signal: Add calculate_sigpending() · 088fe47c
      Eric W. Biederman 提交于
      Add a function calculate_sigpending to test to see if any signals are
      pending for a new task immediately following fork.  Signals have to
      happen either before or after fork.  Today our practice is to push
      all of the signals to before the fork, but that has the downside that
      frequent or periodic signals can make fork take much much longer than
      normal or prevent fork from completing entirely.
      
      So we need move signals that we can after the fork to prevent that.
      
      This updates the code to set TIF_SIGPENDING on a new task if there
      are signals or other activities that have moved so that they appear
      to happen after the fork.
      
      As the code today restarts if it sees any such activity this won't
      immediately have an effect, as there will be no reason for it
      to set TIF_SIGPENDING immediately after the fork.
      
      Adding calculate_sigpending means the code in fork can safely be
      changed to not always restart if a signal is pending.
      
      The new calculate_sigpending function sets sigpending if there
      are pending bits in jobctl, pending signals, the freezer needs
      to freeze the new task or the live kernel patching framework
      need the new thread to take the slow path to userspace.
      
      I have verified that setting TIF_SIGPENDING does make a new process
      take the slow path to userspace before it executes it's first userspace
      instruction.
      
      I have looked at the callers of signal_wake_up and the code paths
      setting TIF_SIGPENDING and I don't see anything else that needs to be
      handled.  The code probably doesn't need to set TIF_SIGPENDING for the
      kernel live patching as it uses a separate thread flag as well.  But
      at this point it seems safer reuse the recalc_sigpending logic and get
      the kernel live patching folks to sort out their story later.
      
      V2: I have moved the test into schedule_tail where siglock can
          be grabbed and recalc_sigpending can be reused directly.
          Further as the last action of setting up a new task this
          guarantees that TIF_SIGPENDING will be properly set in the
          new process.
      
          The helper calculate_sigpending takes the siglock and
          uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
          clear TIF_SIGPENDING if it is unnecessary.  This allows reusing
          the existing code and keeps maintenance of the conditions simple.
      
          Oleg Nesterov <oleg@redhat.com>  suggested the movement
          and pointed out the need to take siglock if this code
          was going to be called while the new task is discoverable.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      088fe47c
  16. 31 7月, 2018 2 次提交
    • J
      tracing: Centralize preemptirq tracepoints and unify their usage · c3bc8fd6
      Joel Fernandes (Google) 提交于
      This patch detaches the preemptirq tracepoints from the tracers and
      keeps it separate.
      
      Advantages:
      * Lockdep and irqsoff event can now run in parallel since they no longer
      have their own calls.
      
      * This unifies the usecase of adding hooks to an irqsoff and irqson
      event, and a preemptoff and preempton event.
        3 users of the events exist:
        - Lockdep
        - irqsoff and preemptoff tracers
        - irqs and preempt trace events
      
      The unification cleans up several ifdefs and makes the code in preempt
      tracer and irqsoff tracers simpler. It gets rid of all the horrific
      ifdeferry around PROVE_LOCKING and makes configuration of the different
      users of the tracepoints more easy and understandable. It also gets rid
      of the time_* function calls from the lockdep hooks used to call into
      the preemptirq tracer which is not needed anymore. The negative delta in
      lines of code in this patch is quite large too.
      
      In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
      as a single point for registering probes onto the tracepoints. With
      this,
      the web of config options for preempt/irq toggle tracepoints and its
      users becomes:
      
       PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
             |                 |     \         |           |
             \    (selects)    /      \        \ (selects) /
            TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                            \                  /
                             \ (depends on)   /
                           PREEMPTIRQ_TRACEPOINTS
      
      Other than the performance tests mentioned in the previous patch, I also
      ran the locking API test suite. I verified that all tests cases are
      passing.
      
      I also injected issues by not registering lockdep probes onto the
      tracepoints and I see failures to confirm that the probes are indeed
      working.
      
      This series + lockdep probes not registered (just to inject errors):
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      With this series + lockdep probes registered, all locking tests pass:
      
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.orgAcked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c3bc8fd6
    • P
      sched/clock: Disable interrupts when calling generic_sched_clock_init() · bd9f943e
      Pavel Tatashin 提交于
      sched_clock_init() used be called early during boot when interrupts were
      still disabled. After the recent changes to utilize sched clock early the
      sched_clock_init() call happens when interrupts are already enabled, which
      triggers the following warning:
      
      WARNING: CPU: 0 PID: 0 at kernel/time/sched_clock.c:180 sched_clock_register+0x44/0x278
      [<c001a13c>] (warn_slowpath_null) from [<c052367c>] (sched_clock_register+0x44/0x278)
      [<c052367c>] (sched_clock_register) from [<c05238d8>] (generic_sched_clock_init+0x28/0x88)
      [<c05238d8>] (generic_sched_clock_init) from [<c0521a00>] (sched_clock_init+0x54/0x74)
      [<c0521a00>] (sched_clock_init) from [<c0519c18>] (start_kernel+0x310/0x3e4)
      [<c0519c18>] (start_kernel) from [<00000000>] (  (null))
      
      Disable IRQs for the duration of generic_sched_clock_init().
      
      Fixes: 857baa87 ("sched/clock: Enable sched clock early")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: steven.sistare@oracle.com
      Cc: daniel.m.jordan@oracle.com
      Link: https://lkml.kernel.org/r/20180730135252.24599-1-pasha.tatashin@oracle.com
      bd9f943e
  17. 25 7月, 2018 10 次提交
    • S
      sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() · b6a60cf3
      Srikar Dronamraju 提交于
      numa_migrate_preferred() is called periodically or when task preferred
      node changes. Preferred node evaluations happen once per scan sequence.
      
      If the scan completion happens just after the periodic NUMA migration,
      then we try to migrate to the preferred node and the preferred node might
      change, needing another node migration.
      
      Avoid this by checking for scan sequence completion only when checking
      for periodic migration.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25862.6     26158.1     1.14258
      1     74357       72725       -2.19482
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     117019      113992      -2.58
      1     179095      174947      -2.31
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      449.46      770.77      615.22      101.70
      numa01.sh       Sys:      132.72      208.17      170.46       24.96
      numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
      numa02.sh      Real:       60.85       61.79       61.28        0.37
      numa02.sh       Sys:       15.34       24.71       21.08        3.61
      numa02.sh      User:     5204.41     5249.85     5231.21       17.60
      numa03.sh      Real:      785.50      916.97      840.77       44.98
      numa03.sh       Sys:      108.08      133.60      119.43        8.82
      numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
      numa04.sh      Real:      429.57      587.37      480.80       57.40
      numa04.sh       Sys:      240.61      321.97      290.84       33.58
      numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
      numa05.sh      Real:      392.09      431.25      414.65       13.82
      numa05.sh       Sys:      229.41      372.48      297.54       53.14
      numa05.sh      User:    33390.86    34697.49    34222.43      556.42
      
      Testcase       Time:         Min         Max         Avg      StdDev 	%Change
      numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
      numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
      numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
      numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
      numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
      numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
      numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
      numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
      numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
      numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
      numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
      numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
      numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
      numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
      numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6a60cf3
    • S
      sched/numa: Use group_weights to identify if migration degrades locality · f35678b6
      Srikar Dronamraju 提交于
      On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
      consolidated to the closest group of nodes. In such a case, relying on
      group_fault metric may not always help to consolidate. There can always
      be a case where a node closer to the preferred node may have lesser
      faults than a node further away from the preferred node. In such a case,
      moving to node with more faults might avoid numa consolidation.
      
      Using group_weight would help to consolidate task/memory around the
      preferred_node.
      
      While here, to be on the conservative side, don't override migrate thread
      degrades locality logic for CPU_NEWLY_IDLE load balancing.
      
      Note: Similar problems exist with should_numa_migrate_memory and will be
      dealt separately.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25645.4     25960       1.22
      1     72142       73550       1.95
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     110199      120071      8.958
      1     176303      176249      -0.03
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      490.04      774.86      596.26       96.46
      numa01.sh       Sys:      151.52      242.88      184.82       31.71
      numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
      numa02.sh      Real:       60.14       62.94       60.98        1.00
      numa02.sh       Sys:       16.11       30.77       21.20        5.28
      numa02.sh      User:     5184.33     5311.09     5228.50       44.24
      numa03.sh      Real:      790.95      856.35      826.41       24.11
      numa03.sh       Sys:      114.93      118.85      117.05        1.63
      numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
      numa04.sh      Real:      434.37      597.92      504.87       59.70
      numa04.sh       Sys:      237.63      397.40      289.74       55.98
      numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
      numa05.sh      Real:      386.77      448.90      417.22       22.79
      numa05.sh       Sys:      149.23      379.95      303.04       79.55
      numa05.sh      User:    32951.76    35959.58    34562.18     1034.05
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
      numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
      numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
      numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
      numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
      numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
      numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
      numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
      numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
      numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
      numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
      numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
      numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
      numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
      numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f35678b6
    • S
      sched/numa: Update the scan period without holding the numa_group lock · 30619c89
      Srikar Dronamraju 提交于
      The metrics for updating scan periods are local or task specific.
      Currently this update happens under the numa_group lock, which seems
      unnecessary. Hence move this update outside the lock.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25355.9     25645.4     1.141
      1     72812       72142       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      30619c89
    • S
      sched/numa: Remove numa_has_capacity() · 2d4056fa
      Srikar Dronamraju 提交于
      task_numa_find_cpu() helps to find the CPU to swap/move the task to.
      It's guarded by numa_has_capacity(). However node not having capacity
      shouldn't deter a task swapping if it helps NUMA placement.
      
      Further load_too_imbalanced(), which evaluates possibilities of move/swap,
      provides similar checks as numa_has_capacity.
      
      Hence remove numa_has_capacity() to enhance possibilities of task
      swapping even if load is imbalanced.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25657.9     25804.1     0.569
      1     74435       73413       -1.37
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4056fa
    • S
      sched/numa: Modify migrate_swap() to accept additional parameters · 0ad4e3df
      Srikar Dronamraju 提交于
      There are checks in migrate_swap_stop() that check if the task/CPU
      combination is as per migrate_swap_arg before migrating.
      
      However atleast one of the two tasks to be swapped by migrate_swap() could
      have migrated to a completely different CPU before updating the
      migrate_swap_arg. The new CPU where the task is currently running could
      be a different node too. If the task has migrated, numa balancer might
      end up placing a task in a wrong node.  Instead of achieving node
      consolidation, it may end up spreading the load across nodes.
      
      To avoid that pass the CPUs as additional parameters.
      
      While here, place migrate_swap under CONFIG_NUMA_BALANCING.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25377.3     25226.6     -0.59
      1     72287       73326       1.437
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ad4e3df
    • S
      sched/numa: Remove unused task_capacity from 'struct numa_stats' · 10864a9e
      Srikar Dronamraju 提交于
      The task_capacity field in 'struct numa_stats' is redundant.
      Also move nr_running for better packing within the struct.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25308.6     25377.3     0.271
      1     72964       72287       -0.92
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@surriel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10864a9e
    • S
      sched/numa: Skip nodes that are at 'hoplimit' · 0ee7e74d
      Srikar Dronamraju 提交于
      When comparing two nodes at a distance of 'hoplimit', we should consider
      nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
      distance too. Hence two nodes at a distance of 'hoplimit' will have same
      groupweight. Fix this by skipping nodes at hoplimit.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25375.3     25308.6     -0.26
      1     72617       72964       0.477
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113372      108750      -4.07684
      1     177403      183115      3.21979
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      478.45      565.90      515.11       30.87
      numa01.sh       Sys:      207.79      271.04      232.94       21.33
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
      numa02.sh      Real:       60.00       61.46       60.78        0.49
      numa02.sh       Sys:       15.71       25.31       20.69        3.42
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82
      numa03.sh      Real:      776.42      834.85      806.01       23.22
      numa03.sh       Sys:      114.43      128.75      121.65        5.49
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
      numa04.sh      Real:      456.93      511.95      482.91       20.88
      numa04.sh       Sys:      178.09      460.89      356.86       94.58
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
      numa05.sh      Real:      393.98      493.48      436.61       35.59
      numa05.sh       Sys:      164.49      329.15      265.87       61.78
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
      numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
      numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
      numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
      numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
      numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
      numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
      numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
      numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
      numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
      numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
      numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
      numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
      numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
      numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%
      
      While there is a regression with this change, this change is needed from a
      correctness perspective. Also it helps consolidation as seen from perf bench
      output.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0ee7e74d
    • S
      sched/debug: Reverse the order of printing faults · 67d9f6c2
      Srikar Dronamraju 提交于
      Fix the order in which the private and shared numa faults are getting
      printed.
      
      No functional changes.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25215.7     25375.3     0.63
      1     72107       72617       0.70
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67d9f6c2
    • S
      sched/numa: Use task faults only if numa_group is not yet set up · f03bb676
      Srikar Dronamraju 提交于
      When numa_group faults are available, task_numa_placement only uses
      numa_group faults to evaluate preferred node. However it still accounts
      task faults and even evaluates the preferred node just based on task
      faults just to discard it in favour of preferred node chosen on the
      basis of numa_group.
      
      Instead use task faults only if numa_group is not set.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25549.6     25215.7     -1.30
      1     73190       72107       -1.47
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     113437      113372      -0.05
      1     196130      177403      -9.54
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      506.35      794.46      599.06      104.26
      numa01.sh       Sys:      150.37      223.56      195.99       24.94
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
      numa02.sh      Real:       60.33       62.40       61.31        0.90
      numa02.sh       Sys:       18.12       31.66       24.28        5.89
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98
      numa03.sh      Real:      696.47      853.62      745.80       57.28
      numa03.sh       Sys:       85.68      123.71       97.89       13.48
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
      numa04.sh      Real:      444.05      514.83      497.06       26.85
      numa04.sh       Sys:      230.39      375.79      316.23       48.58
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
      numa05.sh      Real:      423.09      460.41      439.57       13.92
      numa05.sh       Sys:      287.38      480.15      369.37       68.52
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
      numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
      numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
      numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
      numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
      numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
      numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
      numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
      numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
      numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
      numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
      numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
      numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
      numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
      numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f03bb676
    • S
      sched/numa: Set preferred_node based on best_cpu · 8cd45eee
      Srikar Dronamraju 提交于
      Currently preferred node is set to dst_nid which is the last node in the
      iteration whose group weight or task weight is greater than the current
      node. However it doesn't guarantee that dst_nid has the numa capacity
      to move. It also doesn't guarantee that dst_nid has the best_cpu which
      is the CPU/node ideal for node migration.
      
      Lets consider faults on a 4 node system with group weight numbers
      in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
      is running on 3 and 0 is its preferred node but its capacity is full.
      Consider nodes 1, 2 and 3 have capacity. Then the task should be
      migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
      points to the last node whose faults were greater than current node.
      
      Modify to set the preferred node based of best_cpu. Earlier setting
      preferred node was skipped if nr_active_nodes is 1. This could result in
      the task being moved out of the preferred node to a random node during
      regular load balancing.
      
      Also while modifying task_numa_migrate(), use sched_setnuma to set
      preferred node. This ensures out numa accounting is correct.
      
      Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      16    25122.9     25549.6     1.698
      1     73850       73190       -0.89
      
      Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
      JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
      8     105930      113437      7.08676
      1     178624      196130      9.80047
      
      (numbers from v1 based on v4.17-rc5)
      Testcase       Time:         Min         Max         Avg      StdDev
      numa01.sh      Real:      435.78      653.81      534.58       83.20
      numa01.sh       Sys:      121.93      187.18      145.90       23.47
      numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
      numa02.sh      Real:       60.64       61.63       61.19        0.40
      numa02.sh       Sys:       14.72       25.68       19.06        4.03
      numa02.sh      User:     5210.95     5266.69     5233.30       20.82
      numa03.sh      Real:      746.51      808.24      780.36       23.88
      numa03.sh       Sys:       97.26      108.48      105.07        4.28
      numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
      numa04.sh      Real:      465.97      519.27      484.81       19.62
      numa04.sh       Sys:      304.43      359.08      334.68       20.64
      numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
      numa05.sh      Real:      411.57      457.20      433.29       16.58
      numa05.sh       Sys:      230.05      435.48      339.95       67.58
      numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
      
      Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
      numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
      numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
      numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
      numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
      numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
      numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
      numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
      numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
      numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
      numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
      numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
      numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
      numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
      numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
      numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8cd45eee