1. 30 9月, 2017 15 次提交
    • P
      sched/fair: Align PELT windows between cfs_rq and its se · f207934f
      Peter Zijlstra 提交于
      The PELT _sum values are a saw-tooth function, dropping on the decay
      edge and then growing back up again during the window.
      
      When these window-edges are not aligned between cfs_rq and se, we can
      have the situation where, for example, on dequeue, the se decays
      first.
      
      Its _sum values will be small(er), while the cfs_rq _sum values will
      still be on their way up. Because of this, the subtraction:
      cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
      will then, once the cfs_rq reaches an edge, translate into its _avg
      value jumping up.
      
      This is especially visible with the runnable_load bits, since they get
      added/subtracted a lot.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f207934f
    • P
      sched/fair: Implement synchonous PELT detach on load-balance migrate · 144d8487
      Peter Zijlstra 提交于
      Vincent wondered why his self migrating task had a roughly 50% dip in
      load_avg when landing on the new CPU. This is because we uncondionally
      take the asynchronous detatch_entity route, which can lead to the
      attach on the new CPU still seeing the old CPU's contribution to
      tg->load_avg, effectively halving the new CPU's shares.
      
      While in general this is something we have to live with, there is the
      special case of runnable migration where we can do better.
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      144d8487
    • P
      sched/fair: Propagate an effective runnable_load_avg · 1ea6c46a
      Peter Zijlstra 提交于
      The load balancer uses runnable_load_avg as load indicator. For
      !cgroup this is:
      
        runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
      
      That is, a direct sum of all runnable tasks on that runqueue. As
      opposed to load_avg, which is a sum of all tasks on the runqueue,
      which includes a blocked component.
      
      However, in the cgroup case, this comes apart since the group entities
      are always runnable, even if most of their constituent entities are
      blocked.
      
      Therefore introduce a runnable_weight which for task entities is the
      same as the regular weight, but for group entities is a fraction of
      the entity weight and represents the runnable part of the group
      runqueue.
      
      Then propagate this load through the PELT hierarchy to arrive at an
      effective runnable load avgerage -- which we should not confuse with
      the canonical runnable load average.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1ea6c46a
    • P
      sched/fair: Rewrite PELT migration propagation · 0e2d2aaa
      Peter Zijlstra 提交于
      When an entity migrates in (or out) of a runqueue, we need to add (or
      remove) its contribution from the entire PELT hierarchy, because even
      non-runnable entities are included in the load average sums.
      
      In order to do this we have some propagation logic that updates the
      PELT tree, however the way it 'propagates' the runnable (or load)
      change is (more or less):
      
                           tg->weight * grq->avg.load_avg
        ge->avg.load_avg = ------------------------------
                                     tg->load_avg
      
      But that is the expression for ge->weight, and per the definition of
      load_avg:
      
        ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
      
      That destroys the runnable_avg (by setting it to 1) we wanted to
      propagate.
      
      Instead directly propagate runnable_sum.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e2d2aaa
    • P
      sched/fair: Rewrite cfs_rq->removed_*avg · 2a2f5d4e
      Peter Zijlstra 提交于
      Since on wakeup migration we don't hold the rq->lock for the old CPU
      we cannot update its state. Instead we add the removed 'load' to an
      atomic variable and have the next update on that CPU collect and
      process it.
      
      Currently we have 2 atomic variables; which already have the issue
      that they can be read out-of-sync. Also, two atomic ops on a single
      cacheline is already more expensive than an uncontended lock.
      
      Since we want to add more, convert the thing over to an explicit
      cacheline with a lock in.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2a2f5d4e
    • V
      sched/fair: Use reweight_entity() for set_user_nice() · 9059393e
      Vincent Guittot 提交于
      Now that we directly change load_avg and propagate that change into
      the sums, sys_nice() and co should do the same, otherwise its possible
      to confuse load accounting when we migrate near the weight change.
      Fixes-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added changelog, fixed the call condition. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9059393e
    • P
      sched/fair: More accurate reweight_entity() · 840c5abc
      Peter Zijlstra 提交于
      When a (group) entity changes it's weight we should instantly change
      its load_avg and propagate that change into the sums it is part of.
      Because we use these values to predict future behaviour and are not
      interested in its historical value.
      
      Without this change, the change in load would need to propagate
      through the average, by which time it could again have changed etc..
      always chasing itself.
      
      With this change, the cfs_rq load_avg sum will more accurately reflect
      the current runnable and expected return of blocked load.
      Reported-by: NPaul Turner <pjt@google.com>
      [josef: compile fix !SMP || !FAIR_GROUP]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      840c5abc
    • P
      sched/fair: Introduce {en,de}queue_load_avg() · 8d5b9025
      Peter Zijlstra 提交于
      Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
      for {en,de}queue_load_avg(). More users will follow.
      
      Includes some code movement to avoid fwd declarations.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d5b9025
    • P
      sched/fair: Rename {en,de}queue_entity_load_avg() · b5b3e35f
      Peter Zijlstra 提交于
      Since they're now purely about runnable_load, rename them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5b3e35f
    • P
      sched/fair: Move enqueue migrate handling · b382a531
      Peter Zijlstra 提交于
      Move the entity migrate handling from enqueue_entity_load_avg() to
      update_load_avg(). This has two benefits:
      
       - {en,de}queue_entity_load_avg() will become purely about managing
         runnable_load
      
       - we can avoid a double update_tg_load_avg() and reduce pressure on
         the global tg->shares cacheline
      
      The reason we do this is so that we can change update_cfs_shares() to
      change both weight and (future) runnable_weight. For this to work we
      need to have the cfs_rq averages up-to-date (which means having done
      the attach), but we need the cfs_rq->avg.runnable_avg to not yet
      include the se's contribution (since se->on_rq == 0).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b382a531
    • P
      sched/fair: Change update_load_avg() arguments · 88c0616e
      Peter Zijlstra 提交于
      Most call sites of update_load_avg() already have cfs_rq_of(se)
      available, pass it down instead of recomputing it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88c0616e
    • P
      sched/fair: Remove se->load.weight from se->avg.load_sum · c7b50216
      Peter Zijlstra 提交于
      Remove the load from the load_sum for sched_entities, basically
      turning load_sum into runnable_sum.  This prepares for better
      reweighting of group entities.
      
      Since we now have different rules for computing load_avg, split
      ___update_load_avg() into two parts, ___update_load_sum() and
      ___update_load_avg().
      
      So for se:
      
        ___update_load_sum(.weight = 1)
        ___upate_load_avg(.weight = se->load.weight)
      
      and for cfs_rq:
      
        ___update_load_sum(.weight = cfs_rq->load.weight)
        ___upate_load_avg(.weight = 1)
      
      Since the primary consumable is load_avg, most things will not be
      affected. Only those few sites that initialize/modify load_sum need
      attention.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7b50216
    • P
      sched/fair: Cure calc_cfs_shares() vs. reweight_entity() · 3d4b60d3
      Peter Zijlstra 提交于
      Vincent reported that when running in a cgroup, his root
      cfs_rq->avg.load_avg dropped to 0 on task idle.
      
      This is because reweight_entity() will now immediately propagate the
      weight change of the group entity to its cfs_rq, and as it happens,
      our approxmation (5) for calc_cfs_shares() results in 0 when the group
      is idle.
      
      Avoid this by using the correct (3) as a lower bound on (5). This way
      the empty cgroup will slowly decay instead of instantly drop to 0.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d4b60d3
    • P
      sched/fair: Add comment to calc_cfs_shares() · cef27403
      Peter Zijlstra 提交于
      Explain the magic equation in calc_cfs_shares() a bit better.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cef27403
    • P
      sched/fair: Clean up calc_cfs_shares() · 7c80cfc9
      Peter Zijlstra 提交于
      For consistencies sake, we should have only a single reading of tg->shares.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c80cfc9
  2. 29 9月, 2017 2 次提交
  3. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  4. 12 9月, 2017 4 次提交
  5. 11 9月, 2017 1 次提交
  6. 09 9月, 2017 3 次提交
  7. 07 9月, 2017 2 次提交
  8. 29 8月, 2017 1 次提交
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
  9. 28 8月, 2017 1 次提交
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  10. 25 8月, 2017 4 次提交
    • P
      sched/debug: Optimize sched_domain sysctl generation · bbdacdfe
      Peter Zijlstra 提交于
      Currently we unconditionally destroy all sysctl bits and regenerate
      them after we've rebuild the domains (even if that rebuild is a
      no-op).
      
      And since we unconditionally (re)build the sysctl for all possible
      CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
      only rebuild the bits for CPUs we've actually installed new domains
      on.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bbdacdfe
    • P
      sched/topology: Avoid pointless rebuild · 09e0dd8e
      Peter Zijlstra 提交于
      Fix partition_sched_domains() to try and preserve the existing machine
      wide domain instead of unconditionally destroying it. We do this by
      attempting to allocate the new single domain, only when that fails to
      we reuse the fallback_doms.
      
      When using fallback_doms we need to first destroy and then recreate
      because both the old and new could be backed by it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ofer Levi(SW) <oferle@mellanox.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09e0dd8e
    • P
      sched/topology: Improve comments · a090c4f2
      Peter Zijlstra 提交于
      Mike provided a better comment for destroy_sched_domain() ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a090c4f2
    • S
      sched/topology: Fix memory leak in __sdt_alloc() · 213c5a45
      Shu Wang 提交于
      Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
      __sdt_alloc() might be leaked as each domain holds many groups' ref,
      but in destroy_sched_domain(), it only declined the first group ref.
      
      Onlining and offlining a CPU can trigger this leak, and cause OOM.
      
      Reproducer for my 6 CPUs machine:
      
        while true
        do
            echo 0 > /sys/devices/system/cpu/cpu5/online;
            echo 1 > /sys/devices/system/cpu/cpu5/online;
        done
      
        unreferenced object 0xffff88007d772a80 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00  ."w}............
            40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00  @*w}............
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      
        unreferenced object 0xffff88007d772a40 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00  ................
            00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00  ........O<......
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810da16d>] build_sched_domains+0xead/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NChunyu Hu <chuhu@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: liwang@redhat.com
      Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      213c5a45
  11. 18 8月, 2017 2 次提交
    • V
      cpufreq: schedutil: Always process remote callback with slow switching · c49cbc19
      Viresh Kumar 提交于
      The frequency update from the utilization update handlers can be divided
      into two parts:
      
      (A) Finding the next frequency
      (B) Updating the frequency
      
      While any CPU can do (A), (B) can be restricted to a group of CPUs only,
      depending on the current platform.
      
      For platforms where fast cpufreq switching is possible, both (A) and (B)
      are always done from the same CPU and that CPU should be capable of
      changing the frequency of the target CPU.
      
      But for platforms where fast cpufreq switching isn't possible, after
      doing (A) we wake up a kthread which will eventually do (B). This
      kthread is already bound to the right set of CPUs, i.e. only those which
      can change the frequency of CPUs of a cpufreq policy. And so any CPU
      can actually do (A) in this case, as the frequency is updated from the
      right set of CPUs only.
      
      Check cpufreq_can_do_remote_dvfs() only for the fast switching case.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c49cbc19
    • V
      cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily · e2cabe48
      Viresh Kumar 提交于
      Utilization update callbacks are now processed remotely, even on the
      CPUs that don't share cpufreq policy with the target CPU (if
      dvfs_possible_from_any_cpu flag is set).
      
      But in non-fast switch paths, the frequency is changed only from one of
      policy->related_cpus. This happens because the kthread which does the
      actual update is bound to a subset of CPUs (i.e. related_cpus).
      
      Allow frequency to be remotely updated as well (i.e. call
      __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e2cabe48
  12. 17 8月, 2017 3 次提交
    • P
      completion: Replace spin_unlock_wait() with lock/unlock pair · dec13c42
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore replaces the spin_unlock_wait() call in
      completion_done() with spin_lock() followed immediately by spin_unlock().
      This should be safe from a performance perspective because the lock
      will be held only the wakeup happens really quickly.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dec13c42
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
    • S
      sched/completion: Document that reinit_completion() must be called after complete_all() · 9c878320
      Steven Rostedt 提交于
      The complete_all() function modifies the completion's "done" variable to
      UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
      it back to zero. That means that any call to complete_all() must have a
      reinit_completion() before that completion can be used again.
      
      Document this fact by the complete_all() function.
      
      Also document that completion_done() will always return true if
      complete_all() is called.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c878320