1. 04 5月, 2018 1 次提交
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  2. 03 5月, 2018 1 次提交
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
  3. 06 4月, 2018 1 次提交
  4. 05 4月, 2018 1 次提交
  5. 03 4月, 2018 1 次提交
  6. 27 3月, 2018 1 次提交
  7. 09 3月, 2018 4 次提交
  8. 04 3月, 2018 2 次提交
    • I
      sched/core: Undefine tracepoint creation at the end of core.c · 14a7405b
      Ingo Molnar 提交于
      Make it easier to concatenate all the scheduler .c files for single-module
      compilation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      14a7405b
    • I
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar 提交于
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      325ea10c
  9. 03 3月, 2018 1 次提交
    • I
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar 提交于
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      97fb7a0a
  10. 21 2月, 2018 3 次提交
  11. 13 2月, 2018 2 次提交
    • P
      sched/core: Fix DEBUG_SPINLOCK annotation for rq->lock · 269d5992
      Peter Zijlstra 提交于
      Mark noticed that he had sporadic "spinlock recursion" warnings from
      the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner
      changes in the middle of a context switch.
      
      It so happens that we fix up the lock.owner too late, @prev can run
      (remotely) the moment prev->on_cpu is cleared, this then allows @prev
      to again try and acquire this rq->lock and trigger this warning.
      
      So we have to switch lock.owner before clearing prev->on_cpu.
      
      Do this by moving the DEBUG_SPINLOCK annotation from after switch_to()
      to before switch_to() and collect all lockdep annotations there into
      prepare_lock_switch() to mirror the existing finish_lock_switch().
      Debugged-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      269d5992
    • T
      sched, cgroup: Don't reject lower cpu.max on ancestors · c53593e5
      Tejun Heo 提交于
      While adding cgroup2 interface for the cpu controller, 0d593634
      ("sched: Implement interface for cgroup unified hierarchy") forgot to
      update input validation and left it to reject cpu.max config if any
      descendant has set a higher value.
      
      cgroup2 officially supports delegation and a descendant must not be
      able to restrict what its ancestors can configure.  For absolute
      limits such as cpu.max and memory.max, this means that the config at
      each level should only act as the upper limit at that level and
      shouldn't interfere with what other cgroups can configure.
      
      This patch updates config validation on cgroup2 so that the cpu
      controller follows the same convention.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 0d593634 ("sched: Implement interface for cgroup unified hierarchy")
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # v4.15+
      c53593e5
  12. 07 2月, 2018 1 次提交
  13. 06 2月, 2018 5 次提交
    • M
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman 提交于
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32e839dd
    • P
      sched/core: Optimize ttwu_stat() · b85c8b71
      Peter Zijlstra 提交于
      The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
      there is absolutely no point in then issuing another static_branch for
      every single schedstat_inc() in there.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b85c8b71
    • M
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers 提交于
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70216e18
    • M
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers 提交于
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      306e0604
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  14. 16 1月, 2018 1 次提交
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
  15. 10 1月, 2018 3 次提交
  16. 11 12月, 2017 1 次提交
  17. 05 12月, 2017 1 次提交
  18. 29 11月, 2017 1 次提交
    • P
      sched: Stop resched_cpu() from sending IPIs to offline CPUs · a0982dfa
      Paul E. McKenney 提交于
      The rcutorture test suite occasionally provokes a splat due to invoking
      resched_cpu() on an offline CPU:
      
      WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
      Modules linked in:
      CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff902ede9daf00 task.stack: ffff96c50010c000
      RIP: 0010:native_smp_send_reschedule+0x37/0x40
      RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
      RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
      RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
      RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
      R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
      R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
      FS:  0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
      Call Trace:
       resched_curr+0x8f/0x1c0
       resched_cpu+0x2c/0x40
       rcu_implicit_dynticks_qs+0x152/0x220
       force_qs_rnp+0x147/0x1d0
       ? sync_rcu_exp_select_cpus+0x450/0x450
       rcu_gp_kthread+0x5a9/0x950
       kthread+0x142/0x180
       ? force_qs_rnp+0x1d0/0x1d0
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x27/0x40
      Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
      ---[ end trace 26df9e5df4bba4ac ]---
      
      This splat cannot be generated by expedited grace periods because they
      always invoke resched_cpu() on the current CPU, which is good because
      expedited grace periods require that resched_cpu() unconditionally
      succeed.  However, other parts of RCU can tolerate resched_cpu() acting
      as a no-op, at least as long as it doesn't happen too often.
      
      This commit therefore makes resched_cpu() invoke resched_curr() only if
      the CPU is either online or is the current CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a0982dfa
  19. 28 11月, 2017 1 次提交
  20. 09 11月, 2017 1 次提交
    • P
      sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds · 765cc3a4
      Patrick Bellasi 提交于
      When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
      all SCHED_FEAT are turned into compile time constants being propagated
      to support compiler optimizations.
      
      Specifically, we expect that code blocks like this:
      
         if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
      	/* FEATURE CODE */
         }
      
      are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
      being removed by the compiler from the finale image.
      
      For this mechanism to properly work it's required for the compiler to
      have full access, from each translation unit, to whatever is the value
      defined by the sched_feat macro. This macro is defined as:
      
         #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
      
      and thus, the compiler can optimize that code only if the value of
      sysctl_sched_features is visible within each translation unit.
      
      Since:
      
         029632fb ("sched: Make separate sched*.c translation units")
      
      the scheduler code has been split into separate translation units
      however the definition of sysctl_sched_features is part of
      kernel/sched/core.c while, for all the other scheduler modules, it is
      visible only via kernel/sched/sched.h as an:
      
         extern const_debug unsigned int sysctl_sched_features
      
      Unfortunately, an extern reference does not allow the compiler to apply
      constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
      with code to load a memory reference and (eventually) doing an unconditional
      jump of a chunk of code.
      
      This mechanism is unavoidable when sched_features can be turned on and off at
      run-time. However, this is not the case for "production" kernels compiled with
      !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
      which cannot be changed at run-time and thus memory loads and jumps can be
      avoided altogether.
      
      This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
      of the sysctl_sched_features constant for each translation unit. This will
      ultimately allow the compiler to perform constants propagation and dead-code
      pruning.
      
      Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
      the patch, by running 30 iterations of:
      
         perf bench sched messaging --pipe --thread --group 4 --loop 50000
      
      on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
      powersave governor to rule out variations due to frequency scaling.
      
      Statistics on the reported completion time:
      
                         count     mean       std     min       99%     max
        v4.14-rc8         30.0  15.7831  0.176032  15.442  16.01226  16.014
        v4.14-rc8+patch   30.0  15.5033  0.189681  15.232  15.93938  15.962
      
      ... show a 1.8% speedup on average completion time and 0.5% speedup in the
      99 percentile.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NChris Redpath <chris.redpath@arm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      765cc3a4
  21. 27 10月, 2017 5 次提交
  22. 10 10月, 2017 2 次提交
    • P
      rcutorture: Dump writer stack if stalled · 0032f4e8
      Paul E. McKenney 提交于
      Right now, rcutorture warns if an rcu_torture_writer() kthread stalls,
      but this warning is not always all that helpful.  This commit therefore
      makes the first such warning include a stack dump.
      
      This in turn requires that sched_show_task() be exported to GPL modules,
      so this commit makes that change as well.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0032f4e8
    • P
      sched,rcu: Make cond_resched() provide RCU quiescent state · f79c3ad6
      Paul E. McKenney 提交于
      There is some confusion as to which of cond_resched() or
      cond_resched_rcu_qs() should be added to long in-kernel loops.
      This commit therefore eliminates the decision by adding RCU quiescent
      states to cond_resched().  This commit also simplifies the code that
      used to interact with cond_resched_rcu_qs(), and that now interacts with
      cond_resched(), to reduce its overhead.  This reduction is necessary to
      allow the heavier-weight cond_resched_rcu_qs() mechanism to be invoked
      everywhere that cond_resched() is invoked.
      
      Part of that reduction in overhead converts the jiffies_till_sched_qs
      kernel parameter to read-only at runtime, thus eliminating the need for
      bounds checking.
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      [ paulmck: Keep PREEMPT=n cond_resched a no-op, per Peter Zijlstra. ]
      f79c3ad6