1. 06 2月, 2018 10 次提交
    • M
      sched/fair: Remove unnecessary parameters from wake_affine_idle() · 89a55f56
      Mel Gorman 提交于
      wake_affine_idle() takes parameters it never uses so clean it up.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89a55f56
    • W
      sched/rt: Make update_curr_rt() more accurate · e7ad2031
      Wen Yang 提交于
      rq->clock_task may be updated between the two calls of
      rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
      once makes it more accurate and efficient, taking update_curr() as
      reference.
      Signed-off-by: NWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7ad2031
    • S
      sched/rt: Up the root domain ref count when passing it around via IPIs · 364f5665
      Steven Rostedt (VMware) 提交于
      When issuing an IPI RT push, where an IPI is sent to each CPU that has more
      than one RT task scheduled on it, it references the root domain's rto_mask,
      that contains all the CPUs within the root domain that has more than one RT
      task in the runable state. The problem is, after the IPIs are initiated, the
      rq->lock is released. This means that the root domain that is associated to
      the run queue could be freed while the IPIs are going around.
      
      Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
      the root domain's ref count respectively. This way when initiating the IPIs,
      the scheduler will up the root domain's ref count before releasing the
      rq->lock, ensuring that the root domain does not go away until the IPI round
      is complete.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      364f5665
    • S
      sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() · ad0f1d9d
      Steven Rostedt (VMware) 提交于
      When the rto_push_irq_work_func() is called, it looks at the RT overloaded
      bitmask in the root domain via the runqueue (rq->rd). The problem is that
      during CPU up and down, nothing here stops rq->rd from changing between
      taking the rq->rd->rto_lock and releasing it. That means the lock that is
      released is not the same lock that was taken.
      
      Instead of using this_rq()->rd to get the root domain, as the irq work is
      part of the root domain, we can simply get the root domain from the irq work
      that is passed to the routine:
      
       container_of(work, struct root_domain, rto_push_work)
      
      This keeps the root domain consistent.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad0f1d9d
    • P
      sched/core: Optimize update_stats_*() · 2ed41a55
      Peter Zijlstra 提交于
      These functions are already gated by schedstats_enabled(), there is no
      point in then issuing another static_branch for every individual
      update in them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2ed41a55
    • P
      sched/core: Optimize ttwu_stat() · b85c8b71
      Peter Zijlstra 提交于
      The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
      there is absolutely no point in then issuing another static_branch for
      every single schedstat_inc() in there.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b85c8b71
    • M
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers 提交于
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70216e18
    • M
      membarrier: Provide GLOBAL_EXPEDITED command · c5f58bd5
      Mathieu Desnoyers 提交于
      Allow expedited membarrier to be used for data shared between processes
      through shared memory.
      
      Processes wishing to receive the membarriers register with
      MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED. Those which want to issue
      membarrier invoke MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This allows extremely simple kernel-level implementation: we have almost
      everything we need with the PRIVATE_EXPEDITED barrier code. All we need
      to do is to add a flag in the mm_struct that will be used to check
      whether we need to send the IPI to the current thread of each CPU.
      
      There is a slight downside to this approach compared to targeting
      specific shared memory users: when performing a membarrier operation,
      all registered "global" receivers will get the barrier, even if they
      don't share a memory mapping with the sender issuing
      MEMBARRIER_CMD_GLOBAL_EXPEDITED.
      
      This registration approach seems to fit the requirement of not
      disturbing processes that really deeply care about real-time: they
      simply should not register with MEMBARRIER_CMD_REGISTER_GLOBAL_EXPEDITED.
      
      In order to align the membarrier command names, the "MEMBARRIER_CMD_SHARED"
      command is renamed to "MEMBARRIER_CMD_GLOBAL", keeping an alias of
      MEMBARRIER_CMD_SHARED to MEMBARRIER_CMD_GLOBAL for UAPI header backward
      compatibility.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5f58bd5
    • M
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers 提交于
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      306e0604
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  2. 24 1月, 2018 1 次提交
  3. 16 1月, 2018 1 次提交
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
  4. 10 1月, 2018 18 次提交
    • J
      sched/deadline: Make bandwidth enforcement scale-invariant · 07881166
      Juri Lelli 提交于
      Apply frequency and CPU scale-invariance correction factor to bandwidth
      enforcement (similar to what we already do to fair utilization tracking).
      
      Each delta_exec gets scaled considering current frequency and maximum
      CPU capacity; which means that the reservation runtime parameter (that
      need to be specified profiling the task execution at max frequency on
      biggest capacity core) gets thus scaled accordingly.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      07881166
    • J
      sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP · 7e1a9208
      Juri Lelli 提交于
      Currently, frequency and cpu capacity scaling is only performed on
      CONFIG_SMP systems (as CFS PELT signals are only present for such
      systems). However, other scheduling classes want to do freq/cpu scaling,
      and for !CONFIG_SMP configurations as well.
      
      arch_scale_freq_capacity() is useful to implement frequency scaling even
      on !CONFIG_SMP platforms, so we simply move it outside CONFIG_SMP
      ifdeffery.
      
      Even if arch_scale_cpu_capacity() is not useful on !CONFIG_SMP platforms,
      we make a default implementation available for such configurations anyway
      to simplify scheduler code doing CPU scale invariance.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-8-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7e1a9208
    • J
      sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter · 7673c8a4
      Juri Lelli 提交于
      The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to
      see where information coming from scheduling domains might help doing
      frequency invariance scaling).
      
      Remove it; also in anticipation of moving arch_scale_freq_capacity()
      outside CONFIG_SMP.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Cc: viresh.kumar@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7673c8a4
    • J
      sched/cpufreq: Always consider all CPUs when deciding next freq · 0fa7d181
      Juri Lelli 提交于
      No assumption can be made upon the rate at which frequency updates get
      triggered, as there are scheduling policies (like SCHED_DEADLINE) which
      don't trigger them so frequently.
      
      Remove such assumption from the code, by always considering
      SCHED_DEADLINE utilization signal as not stale.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-6-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0fa7d181
    • J
      sched/cpufreq: Split utilization signals · d18be45d
      Juri Lelli 提交于
      To be able to treat utilization signals of different scheduling classes
      in different ways (e.g., CFS signal might be stale while DEADLINE signal
      is never stale by design) we need to split sugov_cpu::util signal in two:
      util_cfs and util_dl.
      
      This patch does that by also changing sugov_get_util() parameter list.
      After this change, aggregation of the different signals has to be performed
      by sugov_get_util() users (so that they can decide what to do with the
      different signals).
      Suggested-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-5-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d18be45d
    • J
      sched/cpufreq: Change the worker kthread to SCHED_DEADLINE · 794a56eb
      Juri Lelli 提交于
      Worker kthread needs to be able to change frequency for all other
      threads.
      
      Make it special, just under STOP class.
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-4-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      794a56eb
    • J
      sched/deadline: Move CPU frequency selection triggering points · e0367b12
      Juri Lelli 提交于
      Since SCHED_DEADLINE doesn't track utilization signal (but reserves a
      fraction of CPU bandwidth to tasks admitted to the system), there is no
      point in evaluating frequency changes during each tick event.
      
      Move frequency selection triggering points to where running_bw changes.
      Co-authored-by: NClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-3-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e0367b12
    • J
      sched/cpufreq: Use the DEADLINE utilization signal · d4edd662
      Juri Lelli 提交于
      SCHED_DEADLINE tracks active utilization signal with a per dl_rq
      variable named running_bw.
      
      Make use of that to drive CPU frequency selection: add up FAIR and
      DEADLINE contribution to get the required CPU capacity to handle both
      requirements (while RT still selects max frequency).
      Co-authored-by: NClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: alessio.balsini@arm.com
      Cc: bristot@redhat.com
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: mathieu.poirier@linaro.org
      Cc: morten.rasmussen@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: rjw@rjwysocki.net
      Cc: rostedt@goodmis.org
      Cc: tkjos@android.com
      Cc: tommaso.cucinotta@santannapisa.it
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/20171204102325.5110-2-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4edd662
    • J
      sched/deadline: Implement "runtime overrun signal" support · 34be3930
      Juri Lelli 提交于
      This patch adds the possibility of getting the delivery of a SIGXCPU
      signal whenever there is a runtime overrun. The request is done through
      the sched_flags field within the sched_attr structure.
      
      Forward port of https://lkml.org/lkml/2009/10/16/170Tested-by: NMathieu Poirier <mathieu.poirier@linaro.org>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1513077024-25461-1-git-send-email-claudio@evidence.eu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      34be3930
    • M
      sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache · 7332dec0
      Mel Gorman 提交于
      If waking from an idle CPU due to an interrupt then it's possible that
      the waker task will be pulled to wake on the current CPU. Unfortunately,
      depending on the type of interrupt and IRQ configuration, there may not
      be a strong relationship between the CPU an interrupt was delivered on
      and the CPU a task was running on. For example, the interrupts could all
      be delivered to CPUs on one particular node due to the machine topology
      or IRQ affinity configuration. Another example is an interrupt for an IO
      completion which can be delivered to any CPU where there is no guarantee
      the data is either cache hot or even local.
      
      This patch was motivated by the observation that an IO workload was
      being pulled cross-node on a frequent basis when IO completed.  From a
      wakeup latency perspective, it's still useful to know that an idle CPU is
      immediately available for use but lets only consider an automatic migration
      if the CPUs share cache to limit damage due to NUMA migrations. Migrations
      may still occur if wake_affine_weight determines it's appropriate.
      
      These are the throughput results for dbench running on ext4 comparing
      4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO
      completions can happen on any CPU.
      
                                4.15.0-rc3             4.15.0-rc3
                                   vanilla            lessmigrate
      Hmean     1        854.64 (   0.00%)      865.01 (   1.21%)
      Hmean     2       1229.60 (   0.00%)     1274.44 (   3.65%)
      Hmean     4       1591.81 (   0.00%)     1628.08 (   2.28%)
      Hmean     8       1845.04 (   0.00%)     1831.80 (  -0.72%)
      Hmean     16      2038.61 (   0.00%)     2091.44 (   2.59%)
      Hmean     32      2327.19 (   0.00%)     2430.29 (   4.43%)
      Hmean     64      2570.61 (   0.00%)     2568.54 (  -0.08%)
      Hmean     128     2481.89 (   0.00%)     2499.28 (   0.70%)
      Stddev    1         14.31 (   0.00%)        5.35 (  62.65%)
      Stddev    2         21.29 (   0.00%)       11.09 (  47.92%)
      Stddev    4          7.22 (   0.00%)        6.80 (   5.92%)
      Stddev    8         26.70 (   0.00%)        9.41 (  64.76%)
      Stddev    16        22.40 (   0.00%)       20.01 (  10.70%)
      Stddev    32        45.13 (   0.00%)       44.74 (   0.85%)
      Stddev    64        93.10 (   0.00%)       93.18 (  -0.09%)
      Stddev    128      184.28 (   0.00%)      177.85 (   3.49%)
      
      Note the small increase in throughput for low thread counts but also
      note that the standard deviation for each sample during the test run is
      lower. The throughput figures for dbench can be misleading so the benchmark
      is actually modified to time the latency of the processing of one load
      file with many samples taken. The difference in latency is
      
                                 4.15.0-rc3             4.15.0-rc3
                                    vanilla            lessmigrate
      Amean      1         21.71 (   0.00%)       21.47 (   1.08%)
      Amean      2         30.89 (   0.00%)       29.58 (   4.26%)
      Amean      4         47.54 (   0.00%)       46.61 (   1.97%)
      Amean      8         82.71 (   0.00%)       82.81 (  -0.12%)
      Amean      16       149.45 (   0.00%)      145.01 (   2.97%)
      Amean      32       265.49 (   0.00%)      248.43 (   6.42%)
      Amean      64       463.23 (   0.00%)      463.55 (  -0.07%)
      Amean      128      933.97 (   0.00%)      935.50 (  -0.16%)
      Stddev     1          1.58 (   0.00%)        1.54 (   2.26%)
      Stddev     2          2.84 (   0.00%)        2.95 (  -4.15%)
      Stddev     4          6.78 (   0.00%)        6.85 (  -0.99%)
      Stddev     8         16.85 (   0.00%)       16.37 (   2.85%)
      Stddev     16        41.59 (   0.00%)       41.04 (   1.32%)
      Stddev     32       111.05 (   0.00%)      105.11 (   5.35%)
      Stddev     64       285.94 (   0.00%)      288.01 (  -0.72%)
      Stddev     128      803.39 (   0.00%)      809.73 (  -0.79%)
      
      It's a small improvement which is not surprising given that migrations that
      migrate to a different node as not that common. However, it is noticeable
      in the CPU migration statistics which are reduced by 24%.
      
      There was a query for v1 of this patch about NAS so here are the results
      for C-class using MPI for parallelisation on the same machine
      
      nas-mpi
                            4.15.0-rc3             4.15.0-rc3
                               vanilla                  noirq
      Time cg.C       24.25 (   0.00%)       23.17 (   4.45%)
      Time ep.C        8.22 (   0.00%)        8.29 (  -0.85%)
      Time ft.C       22.67 (   0.00%)       20.34 (  10.28%)
      Time is.C        1.42 (   0.00%)        1.47 (  -3.52%)
      Time lu.C       55.62 (   0.00%)       54.81 (   1.46%)
      Time mg.C        7.93 (   0.00%)        7.91 (   0.25%)
      
                4.15.0-rc3  4.15.0-rc3
                   vanilla  noirq-v1r1
      User         3799.96     3748.34
      System        672.10      626.15
      Elapsed        91.91       79.49
      
      lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small
      regressions but in terms of absolute time, the difference is small and
      likely within run-to-run variance. System CPU usage is slightly reduced.
      
      schbench from Facebook was also requested. This is a bit of a mixed bag but
      it's important to note that this workload should not be heavily impacted
      by wakeups from interrupt context.
      
                                       4.15.0-rc3             4.15.0-rc3
                                          vanilla             noirq-v1r1
      Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
      Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
      Lat 90.00th-qrtle-1        43.00 (   0.00%)       44.00 (  -2.33%)
      Lat 95.00th-qrtle-1        44.00 (   0.00%)       46.00 (  -4.55%)
      Lat 99.00th-qrtle-1        57.00 (   0.00%)       58.00 (  -1.75%)
      Lat 99.50th-qrtle-1        59.00 (   0.00%)       59.00 (   0.00%)
      Lat 99.90th-qrtle-1        67.00 (   0.00%)       78.00 ( -16.42%)
      Lat 50.00th-qrtle-2        40.00 (   0.00%)       51.00 ( -27.50%)
      Lat 75.00th-qrtle-2        45.00 (   0.00%)       56.00 ( -24.44%)
      Lat 90.00th-qrtle-2        53.00 (   0.00%)       59.00 ( -11.32%)
      Lat 95.00th-qrtle-2        57.00 (   0.00%)       61.00 (  -7.02%)
      Lat 99.00th-qrtle-2        67.00 (   0.00%)       71.00 (  -5.97%)
      Lat 99.50th-qrtle-2        69.00 (   0.00%)       74.00 (  -7.25%)
      Lat 99.90th-qrtle-2        83.00 (   0.00%)       77.00 (   7.23%)
      Lat 50.00th-qrtle-4        51.00 (   0.00%)       51.00 (   0.00%)
      Lat 75.00th-qrtle-4        57.00 (   0.00%)       56.00 (   1.75%)
      Lat 90.00th-qrtle-4        60.00 (   0.00%)       59.00 (   1.67%)
      Lat 95.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
      Lat 99.00th-qrtle-4        73.00 (   0.00%)       72.00 (   1.37%)
      Lat 99.50th-qrtle-4        76.00 (   0.00%)       74.00 (   2.63%)
      Lat 99.90th-qrtle-4        85.00 (   0.00%)       78.00 (   8.24%)
      Lat 50.00th-qrtle-8        54.00 (   0.00%)       58.00 (  -7.41%)
      Lat 75.00th-qrtle-8        59.00 (   0.00%)       62.00 (  -5.08%)
      Lat 90.00th-qrtle-8        65.00 (   0.00%)       66.00 (  -1.54%)
      Lat 95.00th-qrtle-8        67.00 (   0.00%)       70.00 (  -4.48%)
      Lat 99.00th-qrtle-8        78.00 (   0.00%)       79.00 (  -1.28%)
      Lat 99.50th-qrtle-8        81.00 (   0.00%)       80.00 (   1.23%)
      Lat 99.90th-qrtle-8       116.00 (   0.00%)       83.00 (  28.45%)
      Lat 50.00th-qrtle-16       65.00 (   0.00%)       64.00 (   1.54%)
      Lat 75.00th-qrtle-16       77.00 (   0.00%)       71.00 (   7.79%)
      Lat 90.00th-qrtle-16       83.00 (   0.00%)       82.00 (   1.20%)
      Lat 95.00th-qrtle-16       87.00 (   0.00%)       87.00 (   0.00%)
      Lat 99.00th-qrtle-16       95.00 (   0.00%)       96.00 (  -1.05%)
      Lat 99.50th-qrtle-16       99.00 (   0.00%)      103.00 (  -4.04%)
      Lat 99.90th-qrtle-16      104.00 (   0.00%)      122.00 ( -17.31%)
      Lat 50.00th-qrtle-32       71.00 (   0.00%)       73.00 (  -2.82%)
      Lat 75.00th-qrtle-32       91.00 (   0.00%)       92.00 (  -1.10%)
      Lat 90.00th-qrtle-32      108.00 (   0.00%)      107.00 (   0.93%)
      Lat 95.00th-qrtle-32      118.00 (   0.00%)      115.00 (   2.54%)
      Lat 99.00th-qrtle-32      134.00 (   0.00%)      129.00 (   3.73%)
      Lat 99.50th-qrtle-32      138.00 (   0.00%)      133.00 (   3.62%)
      Lat 99.90th-qrtle-32      149.00 (   0.00%)      146.00 (   2.01%)
      Lat 50.00th-qrtle-39       83.00 (   0.00%)       81.00 (   2.41%)
      Lat 75.00th-qrtle-39      105.00 (   0.00%)      102.00 (   2.86%)
      Lat 90.00th-qrtle-39      120.00 (   0.00%)      119.00 (   0.83%)
      Lat 95.00th-qrtle-39      129.00 (   0.00%)      128.00 (   0.78%)
      Lat 99.00th-qrtle-39      153.00 (   0.00%)      149.00 (   2.61%)
      Lat 99.50th-qrtle-39      166.00 (   0.00%)      156.00 (   6.02%)
      Lat 99.90th-qrtle-39    12304.00 (   0.00%)    12848.00 (  -4.42%)
      
      When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there
      are small gains in many cases. Otherwise it depends on the quartile used
      where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results
      are probably a co-incidence. For this workload, much depends on what node
      the threads get placed on and their relative locality and not wakeups from
      interrupt context. A larger component on how it behaves would be automatic
      NUMA balancing where a fault incurred to measure locality would be a much
      larger contributer to latency than the wakeup path.
      
      This is the results from an almost identical machine that happened to run
      the same test.  They only differ in terms of storage which is irrelevant
      for this test.
      
                                       4.15.0-rc3             4.15.0-rc3
                                          vanilla             noirq-v1r1
      Lat 50.00th-qrtle-1        41.00 (   0.00%)       41.00 (   0.00%)
      Lat 75.00th-qrtle-1        42.00 (   0.00%)       42.00 (   0.00%)
      Lat 90.00th-qrtle-1        44.00 (   0.00%)       43.00 (   2.27%)
      Lat 95.00th-qrtle-1        53.00 (   0.00%)       45.00 (  15.09%)
      Lat 99.00th-qrtle-1        59.00 (   0.00%)       58.00 (   1.69%)
      Lat 99.50th-qrtle-1        60.00 (   0.00%)       59.00 (   1.67%)
      Lat 99.90th-qrtle-1        86.00 (   0.00%)       61.00 (  29.07%)
      Lat 50.00th-qrtle-2        52.00 (   0.00%)       41.00 (  21.15%)
      Lat 75.00th-qrtle-2        57.00 (   0.00%)       46.00 (  19.30%)
      Lat 90.00th-qrtle-2        60.00 (   0.00%)       53.00 (  11.67%)
      Lat 95.00th-qrtle-2        62.00 (   0.00%)       57.00 (   8.06%)
      Lat 99.00th-qrtle-2        73.00 (   0.00%)       68.00 (   6.85%)
      Lat 99.50th-qrtle-2        74.00 (   0.00%)       71.00 (   4.05%)
      Lat 99.90th-qrtle-2        90.00 (   0.00%)       75.00 (  16.67%)
      Lat 50.00th-qrtle-4        57.00 (   0.00%)       52.00 (   8.77%)
      Lat 75.00th-qrtle-4        60.00 (   0.00%)       58.00 (   3.33%)
      Lat 90.00th-qrtle-4        62.00 (   0.00%)       62.00 (   0.00%)
      Lat 95.00th-qrtle-4        65.00 (   0.00%)       65.00 (   0.00%)
      Lat 99.00th-qrtle-4        76.00 (   0.00%)       75.00 (   1.32%)
      Lat 99.50th-qrtle-4        77.00 (   0.00%)       77.00 (   0.00%)
      Lat 99.90th-qrtle-4        87.00 (   0.00%)       81.00 (   6.90%)
      Lat 50.00th-qrtle-8        59.00 (   0.00%)       57.00 (   3.39%)
      Lat 75.00th-qrtle-8        63.00 (   0.00%)       62.00 (   1.59%)
      Lat 90.00th-qrtle-8        66.00 (   0.00%)       67.00 (  -1.52%)
      Lat 95.00th-qrtle-8        68.00 (   0.00%)       70.00 (  -2.94%)
      Lat 99.00th-qrtle-8        79.00 (   0.00%)       80.00 (  -1.27%)
      Lat 99.50th-qrtle-8        80.00 (   0.00%)       84.00 (  -5.00%)
      Lat 99.90th-qrtle-8        84.00 (   0.00%)       90.00 (  -7.14%)
      Lat 50.00th-qrtle-16       65.00 (   0.00%)       65.00 (   0.00%)
      Lat 75.00th-qrtle-16       77.00 (   0.00%)       75.00 (   2.60%)
      Lat 90.00th-qrtle-16       84.00 (   0.00%)       83.00 (   1.19%)
      Lat 95.00th-qrtle-16       88.00 (   0.00%)       87.00 (   1.14%)
      Lat 99.00th-qrtle-16       97.00 (   0.00%)       96.00 (   1.03%)
      Lat 99.50th-qrtle-16      100.00 (   0.00%)      104.00 (  -4.00%)
      Lat 99.90th-qrtle-16      110.00 (   0.00%)      126.00 ( -14.55%)
      Lat 50.00th-qrtle-32       70.00 (   0.00%)       71.00 (  -1.43%)
      Lat 75.00th-qrtle-32       92.00 (   0.00%)       94.00 (  -2.17%)
      Lat 90.00th-qrtle-32      110.00 (   0.00%)      110.00 (   0.00%)
      Lat 95.00th-qrtle-32      121.00 (   0.00%)      118.00 (   2.48%)
      Lat 99.00th-qrtle-32      135.00 (   0.00%)      137.00 (  -1.48%)
      Lat 99.50th-qrtle-32      140.00 (   0.00%)      146.00 (  -4.29%)
      Lat 99.90th-qrtle-32      150.00 (   0.00%)      160.00 (  -6.67%)
      Lat 50.00th-qrtle-39       80.00 (   0.00%)       71.00 (  11.25%)
      Lat 75.00th-qrtle-39      102.00 (   0.00%)       91.00 (  10.78%)
      Lat 90.00th-qrtle-39      118.00 (   0.00%)      108.00 (   8.47%)
      Lat 95.00th-qrtle-39      128.00 (   0.00%)      117.00 (   8.59%)
      Lat 99.00th-qrtle-39      149.00 (   0.00%)      133.00 (  10.74%)
      Lat 99.50th-qrtle-39      160.00 (   0.00%)      139.00 (  13.12%)
      Lat 99.90th-qrtle-39    13808.00 (   0.00%)     4920.00 (  64.37%)
      
      Despite being nearly identical, it showed a variety of major gains so
      I'm not convinced that heavy emphasis should be placed on this particular
      workload in terms of evaluating this particular patch. Further evidence of
      this is the fact that testing on a UMA machine showed small gains/losses
      even though the patch should be a no-op on UMA.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7332dec0
    • J
      sched/fair: Correct obsolete comment about cpufreq_update_util() · 9783be2c
      Joel Fernandes 提交于
      Since the remote cpufreq callback work, the cpufreq_update_util() call can happen
      from remote CPUs. The comment about local CPUs is thus obsolete. Update it
      accordingly.
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Android Kernel <kernel-team@android.com>
      Cc: Atish Patra <atish.patra@oracle.com>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: EAS Dev <eas-dev@lists.linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Ramussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rohit Jain <rohit.k.jain@oracle.com>
      Cc: Saravana Kannan <skannan@quicinc.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikram Mulukutla <markivx@codeaurora.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9783be2c
    • J
      sched/fair: Remove impossible condition from find_idlest_group_cpu() · 18cec7e0
      Joel Fernandes 提交于
      find_idlest_group_cpu() goes through CPUs of a group previous selected by
      find_idlest_group(). find_idlest_group() returns NULL if the local group is the
      selected one and doesn't execute find_idlest_group_cpu if the group to which
      'cpu' belongs to is chosen. So we're always guaranteed to call
      find_idlest_group_cpu() with a group to which 'cpu' is non-local.
      
      This makes one of the conditions in find_idlest_group_cpu() an impossible one,
      which we can get rid off.
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Android Kernel <kernel-team@android.com>
      Cc: Atish Patra <atish.patra@oracle.com>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: EAS Dev <eas-dev@lists.linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Ramussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Rohit Jain <rohit.k.jain@oracle.com>
      Cc: Saravana Kannan <skannan@quicinc.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikram Mulukutla <markivx@codeaurora.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18cec7e0
    • V
      sched/cpufreq: Don't pass flags to sugov_set_iowait_boost() · 5083452f
      Viresh Kumar 提交于
      We are already passing sg_cpu as argument to sugov_set_iowait_boost()
      helper and the same can be used to retrieve the flags value. Get rid of
      the redundant argument.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: juri.lelli@redhat.com
      Cc: morten.rasmussen@arm.com
      Cc: tkjos@android.com
      Link: http://lkml.kernel.org/r/4ec5562b1a87e146ebab11fb5dde1ca9c763a7fb.1513158452.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5083452f
    • V
      sched/cpufreq: Initialize sg_cpu->flags to 0 · 6257e704
      Viresh Kumar 提交于
      Initializing sg_cpu->flags to SCHED_CPUFREQ_RT has no obvious benefit.
      The flags field wouldn't be used until the utilization update handler is
      called for the first time, and once that is called we will overwrite
      flags anyway.
      
      Initialize it to 0.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJuri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael Wysocki <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: dietmar.eggemann@arm.com
      Cc: joelaf@google.com
      Cc: morten.rasmussen@arm.com
      Cc: tkjos@android.com
      Link: http://lkml.kernel.org/r/763feda6424ced8486b25a0c52979634e6104478.1513158452.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6257e704
    • J
      sched/fair: Consider RT/IRQ pressure in capacity_spare_wake() · f453ae22
      Joel Fernandes 提交于
      capacity_spare_wake() in the slow path influences choice of idlest groups,
      as we search for groups with maximum spare capacity. In scenarios where
      RT pressure is high, a sub optimal group can be chosen and hurt
      performance of the task being woken up.
      
      Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake().
      
      Tests results from improvements with this change are below. More tests
      were also done by myself and Matt Fleming to ensure no degradation in
      different benchmarks.
      
      1) Rohit ran barrier.c test (details below) with following improvements:
      ------------------------------------------------------------------------
      This was Rohit's original use case for a patch he posted at [1] however
      from his recent tests he showed my patch can replace his slow path
      changes [1] and there's no need to selectively scan/skip CPUs in
      find_idlest_group_cpu in the slow path to get the improvement he sees.
      
      barrier.c (open_mp code) as a micro-benchmark. It does a number of
      iterations and barrier sync at the end of each for loop.
      
      Here barrier,c is running in along with ping on CPU 0 and 1 as:
      'ping -l 10000 -q -s 10 -f hostX'
      
      barrier.c can be found at:
      http://www.spinics.net/lists/kernel/msg2506955.html
      
      Following are the results for the iterations per second with this
      micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads
      Intel x86 machine:
      +--------+------------------+---------------------------+
      |Threads | Without patch    | With patch                |
      |        |                  |                           |
      +--------+--------+---------+-----------------+---------+
      |        | Mean   | Std Dev | Mean            | Std Dev |
      +--------+--------+---------+-----------------+---------+
      |1       | 539.36 | 60.16   | 572.54 (+6.15%) | 40.95   |
      |2       | 481.01 | 19.32   | 530.64 (+10.32%)| 56.16   |
      |4       | 474.78 | 22.28   | 479.46 (+0.99%) | 18.89   |
      |8       | 450.06 | 24.91   | 447.82 (-0.50%) | 12.36   |
      |16      | 436.99 | 22.57   | 441.88 (+1.12%) | 7.39    |
      |32      | 388.28 | 55.59   | 429.4  (+10.59%)| 31.14   |
      |64      | 314.62 | 6.33    | 311.81 (-0.89%) | 11.99   |
      +--------+--------+---------+-----------------+---------+
      
      2) ping+hackbench test on bare-metal sever (by Rohit)
      -----------------------------------------------------
      Here hackbench is running in threaded mode along
      with, running ping on CPU 0 and 1 as:
      'ping -l 10000 -q -s 10 -f hostX'
      
      This test is running on 2 socket, 20 core and 40 threads Intel x86
      machine:
      Number of loops is 10000 and runtime is in seconds (Lower is better).
      
      +--------------+-----------------+--------------------------+
      |Task Groups   | Without patch   |  With patch              |
      |              +-------+---------+----------------+---------+
      |(Groups of 40)| Mean  | Std Dev |  Mean          | Std Dev |
      +--------------+-------+---------+----------------+---------+
      |1             | 0.851 | 0.007   |  0.828 (+2.77%)| 0.032   |
      |2             | 1.083 | 0.203   |  1.087 (-0.37%)| 0.246   |
      |4             | 1.601 | 0.051   |  1.611 (-0.62%)| 0.055   |
      |8             | 2.837 | 0.060   |  2.827 (+0.35%)| 0.031   |
      |16            | 5.139 | 0.133   |  5.107 (+0.63%)| 0.085   |
      |25            | 7.569 | 0.142   |  7.503 (+0.88%)| 0.143   |
      +--------------+-------+---------+----------------+---------+
      
      [1] https://patchwork.kernel.org/patch/9991635/
      
      Matt Fleming also ran several different hackbench tests and cyclic test
      to santiy-check that the patch doesn't harm other usecases.
      Tested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Tested-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NJoel Fernandes <joelaf@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Atish Patra <atish.patra@oracle.com>
      Cc: Brendan Jackman <brendan.jackman@arm.com>
      Cc: Chris Redpath <Chris.Redpath@arm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Ramussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Saravana Kannan <skannan@quicinc.com>
      Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vikram Mulukutla <markivx@codeaurora.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f453ae22
    • P
      sched/fair: Use 'unsigned long' for utilization, consistently · f01415fd
      Patrick Bellasi 提交于
      Utilization and capacity are tracked as 'unsigned long', however some
      functions using them return an 'int' which is ultimately assigned back to
      'unsigned long' variables.
      
      Since there is not scope on using a different and signed type,
      consolidate the signature of functions returning utilization to always
      use the native type.
      
      This change improves code consistency, and it also benefits
      code paths where utilizations should be clamped by avoiding
      further type conversions or ugly type casts.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Redpath <chris.redpath@arm.com>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f01415fd
    • R
      sched/core: Rework and clarify prepare_lock_switch() · 31cb1bc0
      rodrigosiqueira 提交于
      The prepare_lock_switch() function has an unused parameter, and also the
      function name was not descriptive. To improve readability and remove
      the extra parameter, do the following changes:
      
      * Move prepare_lock_switch() from kernel/sched/sched.h to
        kernel/sched/core.c, rename it to prepare_task(), and remove the
        unused parameter.
      
      * Split the smp_store_release() out from finish_lock_switch() to a
        function named finish_task.
      
      * Comments ajdustments.
      Signed-off-by: NRodrigo Siqueira <rodrigosiqueiramelo@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171215140603.gxe5i2y6fg5ojfpp@smtp.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      31cb1bc0
    • M
      membarrier: Disable preemption when calling smp_call_function_many() · 54167607
      Mathieu Desnoyers 提交于
      smp_call_function_many() requires disabling preemption around the call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: <stable@vger.kernel.org> # v4.14+
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54167607
  5. 09 1月, 2018 1 次提交
    • I
      locking/lockdep: Remove cross-release leftovers · 527187d2
      Ingo Molnar 提交于
      There's two cross-release leftover facilities:
      
       - the crossrelease_hist_*() irq-tracing callbacks (NOPs currently)
       - the complete_release_commit() callback (NOP as well)
      
      Remove them.
      
      Cc: David Sterba <dsterba@suse.com>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      527187d2
  6. 28 12月, 2017 1 次提交
  7. 15 12月, 2017 1 次提交
    • S
      sched/rt: Do not pull from current CPU if only one CPU to pull · f73c52a5
      Steven Rostedt 提交于
      Daniel Wagner reported a crash on the BeagleBone Black SoC.
      
      This is a single CPU architecture, and does not have a functional
      arch_send_call_function_single_ipi() implementation which can crash
      the kernel if that is called.
      
      As it only has one CPU, it shouldn't be called, but if the kernel is
      compiled for SMP, the push/pull RT scheduling logic now calls it for
      irq_work if the one CPU is overloaded, it can use that function to call
      itself and crash the kernel.
      
      Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
      only has a single CPU. But SCHED_FEAT is a constant if sched debugging
      is turned off. Another fix can also be used, and this should also help
      with normal SMP machines. That is, do not initiate the pull code if
      there's only one RT overloaded CPU, and that CPU happens to be the
      current CPU that is scheduling in a lower priority task.
      
      Even on a system with many CPUs, if there's many RT tasks waiting to
      run on a single CPU, and that CPU schedules in another RT task of lower
      priority, it will initiate the PULL logic in case there's a higher
      priority RT task on another CPU that is waiting to run. But if there is
      no other CPU with waiting RT tasks, it will initiate the RT pull logic
      on itself (as it still has RT tasks waiting to run). This is a wasted
      effort.
      
      Not only does this help with SMP code where the current CPU is the only
      one with RT overloaded tasks, it should also solve the issue that
      Daniel encountered, because it will prevent the PULL logic from
      executing, as there's only one CPU on the system, and the check added
      here will cause it to exit the RT pull code.
      Reported-by: NDaniel Wagner <wagi@monom.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-rt-users <linux-rt-users@vger.kernel.org>
      Cc: stable@vger.kernel.org
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f73c52a5
  8. 11 12月, 2017 1 次提交
  9. 08 12月, 2017 1 次提交
  10. 07 12月, 2017 2 次提交
    • V
      sched/fair: Update and fix the runnable propagation rule · a4c3c049
      Vincent Guittot 提交于
      Unlike running, the runnable part can't be directly propagated through
      the hierarchy when we migrate a task. The main reason is that runnable
      time can be shared with other sched_entities that stay on the rq and
      this runnable time will also remain on prev cfs_rq and must not be
      removed.
      
      Instead, we can estimate what should be the new runnable of the prev
      cfs_rq and check that this estimation stay in a possible range. The
      prop_runnable_sum is a good estimation when adding runnable_sum but
      fails most often when we remove it. Instead, we could use the formula
      below instead:
      
        gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight
      
      which assumes that tasks are equally runnable which is not true but
      easy to compute.
      
      Beside these estimates, we have several simple rules that help us to filter
      out wrong ones:
      
       - ge->avg.runnable_sum <= than LOAD_AVG_MAX
       - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
       - ge->avg.runnable_sum can't increase when we detach a task
      
      The effect of these fixes is better cgroups balancing.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a4c3c049
    • O
      sched/wait: Fix add_wait_queue() behavioral change · c6b9d9a3
      Omar Sandoval 提交于
      The following cleanup commit:
      
        50816c48 ("sched/wait: Standardize internal naming of wait-queue entries")
      
      ... unintentionally changed the behavior of add_wait_queue() from
      inserting the wait entry at the head of the wait queue to the tail
      of the wait queue.
      
      Beyond a negative performance impact this change in behavior
      theoretically also breaks wait queues which mix exclusive and
      non-exclusive waiters, as non-exclusive waiters will not be
      woken up if they are queued behind enough exclusive waiters.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
      Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c6b9d9a3
  11. 05 12月, 2017 1 次提交
  12. 29 11月, 2017 2 次提交
    • P
      sched: Stop switched_to_rt() from sending IPIs to offline CPUs · 2fe25826
      Paul E. McKenney 提交于
      The rcutorture test suite occasionally provokes a splat due to invoking
      rt_mutex_lock() which needs to boost the priority of a task currently
      sitting on a runqueue that belongs to an offline CPU:
      
      WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
      Modules linked in:
      CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
      RIP: 0010:native_smp_send_reschedule+0x37/0x40
      RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
      RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
      RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
      RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
      R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
      R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
      FS:  0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
      Call Trace:
       resched_curr+0x61/0xd0
       switched_to_rt+0x8f/0xa0
       rt_mutex_setprio+0x25c/0x410
       task_blocks_on_rt_mutex+0x1b3/0x1f0
       rt_mutex_slowlock+0xa9/0x1e0
       rt_mutex_lock+0x29/0x30
       rcu_boost_kthread+0x127/0x3c0
       kthread+0x104/0x140
       ? rcu_report_unblock_qs_rnp+0x90/0x90
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x22/0x30
      Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48
      
      But the target task's priority has already been adjusted, so the only
      purpose of switched_to_rt() invoking resched_curr() is to wake up the
      CPU running some task that needs to be preempted by the boosted task.
      But the CPU is offline, which presumably means that the task must be
      migrated to some other CPU, and that this other CPU will undertake any
      needed preemption at the time of migration.  Because the runqueue lock
      is held when resched_curr() is invoked, we know that the boosted task
      cannot go anywhere, so it is not necessary to invoke resched_curr()
      in this particular case.
      
      This commit therefore makes switched_to_rt() refrain from invoking
      resched_curr() when the target CPU is offline.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      2fe25826
    • P
      sched: Stop resched_cpu() from sending IPIs to offline CPUs · a0982dfa
      Paul E. McKenney 提交于
      The rcutorture test suite occasionally provokes a splat due to invoking
      resched_cpu() on an offline CPU:
      
      WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
      Modules linked in:
      CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff902ede9daf00 task.stack: ffff96c50010c000
      RIP: 0010:native_smp_send_reschedule+0x37/0x40
      RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
      RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
      RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
      RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
      R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
      R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
      FS:  0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
      Call Trace:
       resched_curr+0x8f/0x1c0
       resched_cpu+0x2c/0x40
       rcu_implicit_dynticks_qs+0x152/0x220
       force_qs_rnp+0x147/0x1d0
       ? sync_rcu_exp_select_cpus+0x450/0x450
       rcu_gp_kthread+0x5a9/0x950
       kthread+0x142/0x180
       ? force_qs_rnp+0x1d0/0x1d0
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x27/0x40
      Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
      ---[ end trace 26df9e5df4bba4ac ]---
      
      This splat cannot be generated by expedited grace periods because they
      always invoke resched_cpu() on the current CPU, which is good because
      expedited grace periods require that resched_cpu() unconditionally
      succeed.  However, other parts of RCU can tolerate resched_cpu() acting
      as a no-op, at least as long as it doesn't happen too often.
      
      This commit therefore makes resched_cpu() invoke resched_curr() only if
      the CPU is either online or is the current CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a0982dfa