1. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  2. 31 5月, 2018 4 次提交
    • D
      sched/headers: Fix typo · 595058b6
      Davidlohr Bueso 提交于
      I cannot spell 'throttling'.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530224940.17839-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      595058b6
    • J
      sched/deadline: Fix missing clock update · ecda2b66
      Juri Lelli 提交于
      A missing clock update is causing the following warning:
      
       rq->clock_update_flags < RQCF_ACT_SKIP
       WARNING: CPU: 10 PID: 0 at kernel/sched/sched.h:963 inactive_task_timer+0x5d6/0x720
       Call Trace:
        <IRQ>
        __hrtimer_run_queues+0x10f/0x530
        hrtimer_interrupt+0xe5/0x240
        smp_apic_timer_interrupt+0x79/0x2b0
        apic_timer_interrupt+0xf/0x20
        </IRQ>
        do_idle+0x203/0x280
        cpu_startup_entry+0x6f/0x80
        start_secondary+0x1b0/0x200
        secondary_startup_64+0xa5/0xb0
       hardirqs last  enabled at (793919): [<ffffffffa27c5f6e>] cpuidle_enter_state+0x9e/0x360
       hardirqs last disabled at (793920): [<ffffffffa2a0096e>] interrupt_entry+0xce/0xe0
       softirqs last  enabled at (793922): [<ffffffffa20bef78>] irq_enter+0x68/0x70
       softirqs last disabled at (793921): [<ffffffffa20bef5d>] irq_enter+0x4d/0x70
      
      This happens because inactive_task_timer() calls sub_running_bw() (if
      TASK_DEAD and non_contending) that might trigger a schedutil update,
      which might access the clock. Clock is however currently updated only
      later in inactive_task_timer() function.
      
      Fix the problem by updating the clock right after task_rq_lock().
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180530160809.9074-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ecda2b66
    • P
      sched/core: Require cpu_active() in select_task_rq(), for user tasks · 7af443ee
      Paul Burton 提交于
      select_task_rq() is used in a few paths to select the CPU upon which a
      thread should be run - for example it is used by try_to_wake_up() & by
      fork or exec balancing. As-is it allows use of any online CPU that is
      present in the task's cpus_allowed mask.
      
      This presents a problem because there is a period whilst CPUs are
      brought online where a CPU is marked online, but is not yet fully
      initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
      CPUHP_ONLINE. Usually we don't run any user tasks during this window,
      but there are corner cases where this can happen. An example observed
      is:
      
        - Some user task A, running on CPU X, forks to create task B.
      
        - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
          task_struct::cpu field to X.
      
        - CPU X is offlined.
      
        - Task A, currently somewhere between the __set_task_cpu() in
          copy_process() and the call to wake_up_new_task(), is migrated to
          CPU Y by migrate_tasks() when CPU X is offlined.
      
        - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
          scheduler is now active on CPU X, but there are no user tasks on
          the runqueue.
      
        - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
          select_task_rq() with cpu=X, taken from task B's task_struct,
          and select_task_rq() allows CPU X to be returned.
      
        - Task A enqueues task B on CPU X's runqueue, via activate_task() &
          enqueue_task().
      
        - CPU X now has a user task on its runqueue before it has reached the
          CPUHP_ONLINE state.
      
      In most cases, the user tasks that schedule on the newly onlined CPU
      have no idea that anything went wrong, but one case observed to be
      problematic is if the task goes on to invoke the sched_setaffinity
      syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
      before the CPU that brought it online calls stop_machine_unpark(). This
      means that for a portion of the window of time between
      CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
      cpu_stopper has its enabled field set to false. If a user thread is
      executed on the CPU during this window and it invokes sched_setaffinity
      with a CPU mask that does not include the CPU it's running on, then when
      __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
      migration_cpu_stop() and perform the actual migration away from the CPU
      it will simply return -ENOENT rather than calling migration_cpu_stop().
      We then return from the sched_setaffinity syscall back to the user task
      that is now running on a CPU which it just asked not to run on, and
      which is not present in its cpus_allowed mask.
      
      This patch resolves the problem by having select_task_rq() enforce that
      user tasks run on CPUs that are active - the same requirement that
      select_fallback_rq() already enforces. This should ensure that newly
      onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
      schedule user tasks, and also implies that bringup_wait_for_ap() will
      have called stop_machine_unpark() which resolves the sched_setaffinity
      issue above.
      
      I haven't yet investigated them, but it may be of interest to review
      whether any of the actions performed by hotplug states between
      CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
      effects on user tasks that might schedule before they are reached, which
      might widen the scope of the problem from just affecting the behaviour
      of sched_setaffinity.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af443ee
    • P
      sched/core: Fix rules for running on online && !active CPUs · 175f0e25
      Peter Zijlstra 提交于
      As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
      for running on an online && !active CPU are stricter than just being a
      kthread, you need to be a per-cpu kthread.
      
      If you're not strictly per-CPU, you have better CPUs to run on and
      don't need the partially booted one to get your work done.
      
      The exception is to allow smpboot threads to bootstrap the CPU itself
      and get kernel 'services' initialized before we allow userspace on it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs")
      Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      175f0e25
  3. 25 5月, 2018 3 次提交
  4. 18 5月, 2018 2 次提交
  5. 16 5月, 2018 2 次提交
  6. 14 5月, 2018 2 次提交
    • R
      sched/core: Distinguish between idle_cpu() calls based on desired effect,... · 943d355d
      Rohit Jain 提交于
      sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
      
      In the following commit:
      
        247f2f6f ("sched/core: Don't schedule threads on pre-empted vCPUs")
      
      ... we distinguish between idle_cpu() when the vCPU is not running for
      scheduling threads.
      
      However, the idle_cpu() function is used in other places for
      actually checking whether the state of the CPU is idle or not.
      
      Hence split the use of that function based on the desired return value,
      by introducing the available_idle_cpu() function.
      
      This fixes a (slight) regression in that initial vCPU commit, because
      some code paths (like the load-balancer) don't care and shouldn't care
      if the vCPU is preempted or not, they just want to know if there's any
      tasks on the CPU.
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: linux-kernel@vger.kernel.org
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      943d355d
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  7. 12 5月, 2018 1 次提交
    • M
      Revert "sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()" · 789ba280
      Mel Gorman 提交于
      This reverts commit 7347fc87.
      
      Srikar Dronamra pointed out that while the commit in question did show
      a performance improvement on ppc64, it did so at the cost of disabling
      active CPU migration by automatic NUMA balancing which was not the intent.
      The issue was that a serious flaw in the logic failed to ever active balance
      if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled,
      the logic is still bizarre and against the original intent.
      
      Investigation showed that fixing the patch in either the way he suggested,
      using the correct comparison for jiffies values or introducing a new
      numa_migrate_deferred variable in task_struct all perform similarly to a
      revert with a mix of gains and losses depending on the workload, machine
      and socket count.
      
      The original intent of the commit was to handle a problem whereby
      wake_affine, idle balancing and automatic NUMA balancing disagree on the
      appropriate placement for a task. This was particularly true for cases where
      a single task was a massive waker of tasks but where wake_wide logic did
      not apply.  This was particularly noticeable when a futex (a barrier) woke
      all worker threads and tried pulling the wakees to the waker nodes. In that
      specific case, it could be handled by tuning MPI or openMP appropriately,
      but the behavior is not illogical and was worth attempting to fix. However,
      the approach was wrong. Given that we're at rc4 and a fix is not obvious,
      it's better to play safe, revert this commit and retry later.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: efault@gmx.de
      Cc: ggherdovich@suse.cz
      Cc: hpa@zytor.com
      Cc: matt@codeblueprint.co.uk
      Cc: mpe@ellerman.id.au
      Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      789ba280
  8. 09 5月, 2018 2 次提交
  9. 05 5月, 2018 2 次提交
  10. 04 5月, 2018 4 次提交
    • R
      sched/core: Don't schedule threads on pre-empted vCPUs · 247f2f6f
      Rohit Jain 提交于
      In paravirt configurations today, spinlocks figure out whether a vCPU is
      running to determine whether or not spinlock should bother spinning. We
      can use the same logic to prioritize CPUs when scheduling threads. If a
      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
      the time it actually spends to be running on the host CPU. If we had
      other vCPUs which were actually running on the host CPU and idle we
      should schedule threads there.
      
      Performance numbers:
      
      Note: With patch is referred to as Paravirt in the following and without
      patch is referred to as Base.
      
      1) When only 1 VM is running:
      
          a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):
      
      	+-------+-----------------+----------------+
      	|Number |Paravirt         |Base            |
      	|of     +---------+-------+-------+--------+
      	|Threads|Average  |Std Dev|Average| Std Dev|
      	+-------+---------+-------+-------+--------+
      	|1      |1.817    |0.076  |1.721  | 0.067  |
      	|2      |3.467    |0.120  |3.468  | 0.074  |
      	|4      |6.266    |0.035  |6.314  | 0.068  |
      	|8      |11.437   |0.105  |11.418 | 0.132  |
      	|16     |21.862   |0.167  |22.161 | 0.129  |
      	|25     |33.341   |0.326  |33.692 | 0.147  |
      	+-------+---------+-------+-------+--------+
      
      2) When two VMs are running with same CPU affinities:
      
          a) tbench test on VM 8 cpus
      
          Base:
      
      	VM1:
      
      	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
      	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
      	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
      	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms
      
      	VM2:
      
      	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
      	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
      	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
      	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms
      
          Paravirt:
      
      	VM1:
      
      	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
      	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
      	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
      	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms
      
      	VM2:
      
      	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
      	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
      	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
      	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms
      
          b) Hackbench with 4 fd 1,000,000 loops
      
      	+-------+--------------------------------------+----------------------------------------+
      	|Number |Paravirt                              |Base                                    |
      	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
      	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      
          Note that this test was run just to show the interference effect
          over-subscription can have in baseline
      
          c) schbench results with 2 message groups on 8 vCPU VMs
      
      	+-----------+-------+---------------+--------------+------------+
      	|           |       | Paravirt      | Base         |            |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
      	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
      	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
      	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
      	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
      	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
      	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
      	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
      	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
      	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
      	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
      	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
      	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
      	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
      	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
      	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
      	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
      	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
      	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
      	+-----------+-------+-------+-------+-------+------+------------+
      
          Note: Improvement is measured for (VM1+VM2)
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      247f2f6f
    • V
      sched/fair: Avoid calling sync_entity_load_avg() unnecessarily · c976a862
      Viresh Kumar 提交于
      Call sync_entity_load_avg() directly from find_idlest_cpu() instead of
      select_task_rq_fair(), as that's where we need to use task's utilization
      value. And call sync_entity_load_avg() only after making sure sched
      domain spans over one of the allowed CPUs for the task.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c976a862
    • V
      sched/fair: Rearrange select_task_rq_fair() to optimize it · f1d88b44
      Viresh Kumar 提交于
      Rearrange select_task_rq_fair() a bit to avoid executing some
      conditional statements in few specific code-paths. That gets rid of the
      goto as well.
      
      This shouldn't result in any functional changes.
      Tested-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f1d88b44
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  11. 03 5月, 2018 2 次提交
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
    • V
      sched/fair: Fix the update of blocked load when newly idle · 457be908
      Vincent Guittot 提交于
      With commit:
      
        31e77c93 ("sched/fair: Update blocked load when newly idle")
      
      ... we release the rq->lock when updating blocked load of idle CPUs.
      
      This opens a time window during which another CPU can add a task to this
      CPU's cfs_rq.
      
      The check for newly added task of idle_balance() is not in the common path.
      Move the out label to include this check.
      Reported-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 31e77c93 ("sched/fair: Update blocked load when newly idle")
      Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      457be908
  12. 09 4月, 2018 1 次提交
    • R
      sched: idle: Select idle state before stopping the tick · 554c8aa8
      Rafael J. Wysocki 提交于
      In order to address the issue with short idle duration predictions
      by the idle governor after the scheduler tick has been stopped,
      reorder the code in cpuidle_idle_call() so that the governor idle
      state selection runs before tick_nohz_idle_go_idle() and use the
      "nohz" hint returned by cpuidle_select() to decide whether or not
      to stop the tick.
      
      This isn't straightforward, because menu_select() invokes
      tick_nohz_get_sleep_length() to get the time to the next timer
      event and the number returned by the latter comes from
      __tick_nohz_idle_stop_tick().  Fortunately, however, it is possible
      to compute that number without actually stopping the tick and with
      the help of the existing code.
      
      Namely, tick_nohz_get_sleep_length() can be made call
      tick_nohz_next_event(), introduced earlier, to get the time to the
      next non-highres timer event.  If that happens, tick_nohz_next_event()
      need not be called by __tick_nohz_idle_stop_tick() again.
      
      If it turns out that the scheduler tick cannot be stopped going
      forward or the next timer event is too close for the tick to be
      stopped, tick_nohz_get_sleep_length() can simply return the time to
      the next event currently programmed into the corresponding clock
      event device.
      
      In addition to knowing the return value of tick_nohz_next_event(),
      however, tick_nohz_get_sleep_length() needs to know the time to the
      next highres timer event, but with the scheduler tick timer excluded,
      which can be computed with the help of hrtimer_get_next_event().
      
      That minimum of that number and the tick_nohz_next_event() return
      value is the total time to the next timer event with the assumption
      that the tick will be stopped.  It can be returned to the idle
      governor which can use it for predicting idle duration (under the
      assumption that the tick will be stopped) and deciding whether or
      not it makes sense to stop the tick before putting the CPU into the
      selected idle state.
      
      With the above, the sleep_length field in struct tick_sched is not
      necessary any more, so drop it.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=199227Reported-by: NDoug Smythies <dsmythies@telus.net>
      Reported-by: NThomas Ilsche <thomas.ilsche@tu-dresden.de>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      554c8aa8
  13. 06 4月, 2018 5 次提交
    • R
      cpuidle: Return nohz hint from cpuidle_select() · 45f1ff59
      Rafael J. Wysocki 提交于
      Add a new pointer argument to cpuidle_select() and to the ->select
      cpuidle governor callback to allow a boolean value indicating
      whether or not the tick should be stopped before entering the
      selected state to be returned from there.
      
      Make the ladder governor ignore that pointer (to preserve its
      current behavior) and make the menu governor return 'false" through
      it if:
       (1) the idle exit latency is constrained at 0, or
       (2) the selected state is a polling one, or
       (3) the expected idle period duration is within the tick period
           range.
      
      In addition to that, the correction factor computations in the menu
      governor need to take the possibility that the tick may not be
      stopped into account to avoid artificially small correction factor
      values.  To that end, add a mechanism to record tick wakeups, as
      suggested by Peter Zijlstra, and use it to modify the menu_update()
      behavior when tick wakeup occurs.  Namely, if the CPU is woken up by
      the tick and the return value of tick_nohz_get_sleep_length() is not
      within the tick boundary, the predicted idle duration is likely too
      short, so make menu_update() try to compensate for that by updating
      the governor statistics as though the CPU was idle for a long time.
      
      Since the value returned through the new argument pointer of
      cpuidle_select() is not used by its caller yet, this change by
      itself is not expected to alter the functionality of the code.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      45f1ff59
    • M
      kernel/fork.c: detect early free of a live mm · 3eda69c9
      Mark Rutland 提交于
      KASAN splats indicate that in some cases we free a live mm, then
      continue to access it, with potentially disastrous results.  This is
      likely due to a mismatched mmdrop() somewhere in the kernel, but so far
      the culprit remains elusive.
      
      Let's have __mmdrop() verify that the mm isn't live for the current
      task, similar to the existing check for init_mm.  This way, we can catch
      this class of issue earlier, and without requiring KASAN.
      
      Currently, idle_task_exit() leaves active_mm stale after it switches to
      init_mm.  This isn't harmful, but will trigger the new assertions, so we
      must adjust idle_task_exit() to update active_mm.
      
      Link: http://lkml.kernel.org/r/20180312140103.19235-1-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eda69c9
    • R
      sched: idle: Do not stop the tick before cpuidle_idle_call() · ed98c349
      Rafael J. Wysocki 提交于
      Make cpuidle_idle_call() decide whether or not to stop the tick.
      
      First, the cpuidle_enter_s2idle() path deals with the tick (and with
      the entire timekeeping for that matter) by itself and it doesn't need
      the tick to be stopped beforehand.
      
      Second, to address the issue with short idle duration predictions
      by the idle governor after the tick has been stopped, it will be
      necessary to change the ordering of cpuidle_select() with respect
      to tick_nohz_idle_stop_tick().  To prepare for that, put a
      tick_nohz_idle_stop_tick() call in the same branch in which
      cpuidle_select() is called.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      ed98c349
    • R
      sched: idle: Do not stop the tick upfront in the idle loop · 2aaf709a
      Rafael J. Wysocki 提交于
      Push the decision whether or not to stop the tick somewhat deeper
      into the idle loop.
      
      Stopping the tick upfront leads to unpleasant outcomes in case the
      idle governor doesn't agree with the nohz code on the duration of the
      upcoming idle period.  Specifically, if the tick has been stopped and
      the idle governor predicts short idle, the situation is bad regardless
      of whether or not the prediction is accurate.  If it is accurate, the
      tick has been stopped unnecessarily which means excessive overhead.
      If it is not accurate, the CPU is likely to spend too much time in
      the (shallow, because short idle has been predicted) idle state
      selected by the governor [1].
      
      As the first step towards addressing this problem, change the code
      to make the tick stopping decision inside of the loop in do_idle().
      In particular, do not stop the tick in the cpu_idle_poll() code path.
      Also don't do that in tick_nohz_irq_exit() which doesn't really have
      enough information on whether or not to stop the tick.
      
      Link: https://marc.info/?l=linux-pm&m=150116085925208&w=2 # [1]
      Link: https://tu-dresden.de/zih/forschung/ressourcen/dateien/projekte/haec/powernightmares.pdfSuggested-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      2aaf709a
    • R
      time: tick-sched: Reorganize idle tick management code · 0e776768
      Rafael J. Wysocki 提交于
      Prepare the scheduler tick code for reworking the idle loop to
      avoid stopping the tick in some cases.
      
      The idea is to split the nohz idle entry call to decouple the idle
      time stats accounting and preparatory work from the actual tick stop
      code, in order to later be able to delay the tick stop once we reach
      more power-knowledgeable callers.
      
      Move away the tick_nohz_start_idle() invocation from
      __tick_nohz_idle_enter(), rename the latter to
      __tick_nohz_idle_stop_tick() and define tick_nohz_idle_stop_tick()
      as a wrapper around it for calling it from the outside.
      
      Make tick_nohz_idle_enter() only call tick_nohz_start_idle() instead
      of calling the entire __tick_nohz_idle_enter(), add another wrapper
      disabling and enabling interrupts around tick_nohz_idle_stop_tick()
      and make the current callers of tick_nohz_idle_enter() call it too
      to retain their current functionality.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      0e776768
  14. 05 4月, 2018 2 次提交
  15. 03 4月, 2018 1 次提交
  16. 01 4月, 2018 1 次提交
  17. 27 3月, 2018 1 次提交
  18. 24 3月, 2018 1 次提交
  19. 20 3月, 2018 3 次提交
    • J
      sched/debug: Adjust newlines for better alignment · e9ca2670
      Joe Lawrence 提交于
      Scheduler debug stats include newlines that display out of alignment
      when prefixed by timestamps.  For example, the dmesg utility:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   83.124251]
        runnable tasks:
         S           task   PID         tree-key  switches  prio     wait-time
        sum-exec        sum-sleep
        -----------------------------------------------------------------------------------------------------------
      
      At the same time, some syslog utilities (like rsyslog by default) don't
      like the additional newlines control characters, saving lines like this
      to /var/log/messages:
      
        Mar 16 16:02:29 localhost kernel: #012runnable tasks:#12 S           task   PID         tree-key ...
                                          ^^^^               ^^^^
      Clean these up by moving newline characters to their own SEQ_printf
      invocation.  This leaves the /proc/sched_debug unchanged, but brings the
      entire output into alignment when prefixed:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        [   62.410368] runnable tasks:
        [   62.410368]  S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
        [   62.410369] -----------------------------------------------------------------------------------------------------------
        [   62.410369]  I  kworker/u12:0     5      1932.215593       332   120         0.000000         3.621252         0.000000 0 0 /
      
      and no escaped control characters from rsyslog in /var/log/messages:
      
        Mar 16 16:15:06 localhost kernel: runnable tasks:
        Mar 16 16:15:06 localhost kernel: S           task   PID         tree-key  ...
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e9ca2670
    • J
      sched/debug: Fix per-task line continuation for console output · a8c024cd
      Joe Lawrence 提交于
      When the SEQ_printf() macro prints to the console, it runs a simple
      printk() without KERN_CONT "continued" line printing.  The result of
      this is oddly wrapped task info, for example:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [   29.608611]  I
        [   29.608613]       rcu_sched     8      3252.013846      4087   120
        [   29.608614]         0.000000        29.090111         0.000000
        [   29.608615]  0 0
        [   29.608616]  /
      
      Modify SEQ_printf to use pr_cont() for expected one-line results:
      
        % echo t > /proc/sysrq-trigger
        % dmesg
        ...
        runnable tasks:
        ...
        [  106.716329]  S        cpuhp/5    37      2006.315026        14   120         0.000000         0.496893         0.000000 0 0 /
      Signed-off-by: NJoe Lawrence <joe.lawrence@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8c024cd
    • P
      sched/wait: Improve __var_waitqueue() code generation · b3fc5c9b
      Peter Zijlstra 提交于
      Since we fixed hash_64() to not suck there is no need to play games to
      attempt to improve the hash value on 64-bit.
      
      Also, since we don't use the bit value for the variables, use hash_ptr()
      directly.
      
      No change in functionality.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b3fc5c9b