1. 14 5月, 2018 1 次提交
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  2. 05 5月, 2018 1 次提交
  3. 04 5月, 2018 2 次提交
    • R
      sched/core: Don't schedule threads on pre-empted vCPUs · 247f2f6f
      Rohit Jain 提交于
      In paravirt configurations today, spinlocks figure out whether a vCPU is
      running to determine whether or not spinlock should bother spinning. We
      can use the same logic to prioritize CPUs when scheduling threads. If a
      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
      the time it actually spends to be running on the host CPU. If we had
      other vCPUs which were actually running on the host CPU and idle we
      should schedule threads there.
      
      Performance numbers:
      
      Note: With patch is referred to as Paravirt in the following and without
      patch is referred to as Base.
      
      1) When only 1 VM is running:
      
          a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):
      
      	+-------+-----------------+----------------+
      	|Number |Paravirt         |Base            |
      	|of     +---------+-------+-------+--------+
      	|Threads|Average  |Std Dev|Average| Std Dev|
      	+-------+---------+-------+-------+--------+
      	|1      |1.817    |0.076  |1.721  | 0.067  |
      	|2      |3.467    |0.120  |3.468  | 0.074  |
      	|4      |6.266    |0.035  |6.314  | 0.068  |
      	|8      |11.437   |0.105  |11.418 | 0.132  |
      	|16     |21.862   |0.167  |22.161 | 0.129  |
      	|25     |33.341   |0.326  |33.692 | 0.147  |
      	+-------+---------+-------+-------+--------+
      
      2) When two VMs are running with same CPU affinities:
      
          a) tbench test on VM 8 cpus
      
          Base:
      
      	VM1:
      
      	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
      	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
      	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
      	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms
      
      	VM2:
      
      	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
      	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
      	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
      	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms
      
          Paravirt:
      
      	VM1:
      
      	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
      	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
      	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
      	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms
      
      	VM2:
      
      	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
      	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
      	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
      	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms
      
          b) Hackbench with 4 fd 1,000,000 loops
      
      	+-------+--------------------------------------+----------------------------------------+
      	|Number |Paravirt                              |Base                                    |
      	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
      	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      
          Note that this test was run just to show the interference effect
          over-subscription can have in baseline
      
          c) schbench results with 2 message groups on 8 vCPU VMs
      
      	+-----------+-------+---------------+--------------+------------+
      	|           |       | Paravirt      | Base         |            |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
      	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
      	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
      	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
      	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
      	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
      	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
      	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
      	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
      	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
      	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
      	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
      	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
      	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
      	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
      	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
      	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
      	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
      	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
      	+-----------+-------+-------+-------+-------+------+------------+
      
          Note: Improvement is measured for (VM1+VM2)
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      247f2f6f
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  4. 03 5月, 2018 1 次提交
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
  5. 06 4月, 2018 1 次提交
  6. 05 4月, 2018 1 次提交
  7. 03 4月, 2018 1 次提交
  8. 27 3月, 2018 1 次提交
  9. 09 3月, 2018 4 次提交
  10. 04 3月, 2018 2 次提交
    • I
      sched/core: Undefine tracepoint creation at the end of core.c · 14a7405b
      Ingo Molnar 提交于
      Make it easier to concatenate all the scheduler .c files for single-module
      compilation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      14a7405b
    • I
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar 提交于
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      325ea10c
  11. 03 3月, 2018 1 次提交
    • I
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar 提交于
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      97fb7a0a
  12. 21 2月, 2018 3 次提交
  13. 13 2月, 2018 2 次提交
    • P
      sched/core: Fix DEBUG_SPINLOCK annotation for rq->lock · 269d5992
      Peter Zijlstra 提交于
      Mark noticed that he had sporadic "spinlock recursion" warnings from
      the DEBUG_SPINLOCK code. Now rq->lock is special in that the owner
      changes in the middle of a context switch.
      
      It so happens that we fix up the lock.owner too late, @prev can run
      (remotely) the moment prev->on_cpu is cleared, this then allows @prev
      to again try and acquire this rq->lock and trigger this warning.
      
      So we have to switch lock.owner before clearing prev->on_cpu.
      
      Do this by moving the DEBUG_SPINLOCK annotation from after switch_to()
      to before switch_to() and collect all lockdep annotations there into
      prepare_lock_switch() to mirror the existing finish_lock_switch().
      Debugged-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      269d5992
    • T
      sched, cgroup: Don't reject lower cpu.max on ancestors · c53593e5
      Tejun Heo 提交于
      While adding cgroup2 interface for the cpu controller, 0d593634
      ("sched: Implement interface for cgroup unified hierarchy") forgot to
      update input validation and left it to reject cpu.max config if any
      descendant has set a higher value.
      
      cgroup2 officially supports delegation and a descendant must not be
      able to restrict what its ancestors can configure.  For absolute
      limits such as cpu.max and memory.max, this means that the config at
      each level should only act as the upper limit at that level and
      shouldn't interfere with what other cgroups can configure.
      
      This patch updates config validation on cgroup2 so that the cpu
      controller follows the same convention.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 0d593634 ("sched: Implement interface for cgroup unified hierarchy")
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # v4.15+
      c53593e5
  14. 07 2月, 2018 1 次提交
  15. 06 2月, 2018 5 次提交
    • M
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman 提交于
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32e839dd
    • P
      sched/core: Optimize ttwu_stat() · b85c8b71
      Peter Zijlstra 提交于
      The whole of ttwu_stat() is guarded by a single schedstat_enabled(),
      there is absolutely no point in then issuing another static_branch for
      every single schedstat_inc() in there.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b85c8b71
    • M
      membarrier: Provide core serializing command, *_SYNC_CORE · 70216e18
      Mathieu Desnoyers 提交于
      Provide core serializing membarrier command to support memory reclaim
      by JIT.
      
      Each architecture needs to explicitly opt into that support by
      documenting in their architecture code how they provide the core
      serializing instructions required when returning from the membarrier
      IPI, and after the scheduler has updated the curr->mm pointer (before
      going back to user-space). They should then select
      ARCH_HAS_MEMBARRIER_SYNC_CORE to enable support for that command on
      their architecture.
      
      Architectures selecting this feature need to either document that
      they issue core serializing instructions when returning to user-space,
      or implement their architecture-specific sync_core_before_usermode().
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-9-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      70216e18
    • M
      membarrier: Document scheduler barrier requirements · 306e0604
      Mathieu Desnoyers 提交于
      Document the membarrier requirement on having a full memory barrier in
      __schedule() after coming from user-space, before storing to rq->curr.
      It is provided by smp_mb__after_spinlock() in __schedule().
      
      Document that membarrier requires a full barrier on transition from
      kernel thread to userspace thread. We currently have an implicit barrier
      from atomic_dec_and_test() in mmdrop() that ensures this.
      
      The x86 switch_mm_irqs_off() full barrier is currently provided by many
      cpumask update operations as well as write_cr3(). Document that
      write_cr3() provides this barrier.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      306e0604
    • M
      powerpc, membarrier: Skip memory barrier in switch_mm() · 3ccfebed
      Mathieu Desnoyers 提交于
      Allow PowerPC to skip the full memory barrier in switch_mm(), and
      only issue the barrier when scheduling into a task belonging to a
      process that has registered to use expedited private.
      
      Threads targeting the same VM but which belong to different thread
      groups is a tricky case. It has a few consequences:
      
      It turns out that we cannot rely on get_nr_threads(p) to count the
      number of threads using a VM. We can use
      (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)
      instead to skip the synchronize_sched() for cases where the VM only has
      a single user, and that user only has a single thread.
      
      It also turns out that we cannot use for_each_thread() to set
      thread flags in all threads using a VM, as it only iterates on the
      thread group.
      
      Therefore, test the membarrier state variable directly rather than
      relying on thread flags. This means
      membarrier_register_private_expedited() needs to set the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and
      only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows
      private expedited membarrier commands to succeed.
      membarrier_arch_switch_mm() now tests for the
      MEMBARRIER_STATE_PRIVATE_EXPEDITED flag.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Avi Kivity <avi@scylladb.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: David Sehr <sehr@google.com>
      Cc: Greg Hackmann <ghackmann@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Maged Michael <maged.michael@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20180129202020.8515-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3ccfebed
  16. 16 1月, 2018 1 次提交
    • J
      delayacct: Account blkio completion on the correct task · c96f5471
      Josh Snyder 提交于
      Before commit:
      
        e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      
      delayacct_blkio_end() was called after context-switching into the task which
      completed I/O.
      
      This resulted in double counting: the task would account a delay both waiting
      for I/O and for time spent in the runqueue.
      
      With e33a9bba, delayacct_blkio_end() is called by try_to_wake_up().
      In ttwu, we have not yet context-switched. This is more correct, in that
      the delay accounting ends when the I/O is complete.
      
      But delayacct_blkio_end() relies on 'get_current()', and we have not yet
      context-switched into the task whose I/O completed. This results in the
      wrong task having its delay accounting statistics updated.
      
      Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
      so that it can update the statistics of the correct task.
      Signed-off-by: NJosh Snyder <joshs@netflix.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-block@vger.kernel.org
      Fixes: e33a9bba ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
      Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c96f5471
  17. 10 1月, 2018 3 次提交
  18. 11 12月, 2017 1 次提交
  19. 05 12月, 2017 1 次提交
  20. 29 11月, 2017 1 次提交
    • P
      sched: Stop resched_cpu() from sending IPIs to offline CPUs · a0982dfa
      Paul E. McKenney 提交于
      The rcutorture test suite occasionally provokes a splat due to invoking
      resched_cpu() on an offline CPU:
      
      WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
      Modules linked in:
      CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff902ede9daf00 task.stack: ffff96c50010c000
      RIP: 0010:native_smp_send_reschedule+0x37/0x40
      RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
      RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
      RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
      RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
      R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
      R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
      FS:  0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
      Call Trace:
       resched_curr+0x8f/0x1c0
       resched_cpu+0x2c/0x40
       rcu_implicit_dynticks_qs+0x152/0x220
       force_qs_rnp+0x147/0x1d0
       ? sync_rcu_exp_select_cpus+0x450/0x450
       rcu_gp_kthread+0x5a9/0x950
       kthread+0x142/0x180
       ? force_qs_rnp+0x1d0/0x1d0
       ? kthread_create_on_node+0x40/0x40
       ret_from_fork+0x27/0x40
      Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 <0f> ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
      ---[ end trace 26df9e5df4bba4ac ]---
      
      This splat cannot be generated by expedited grace periods because they
      always invoke resched_cpu() on the current CPU, which is good because
      expedited grace periods require that resched_cpu() unconditionally
      succeed.  However, other parts of RCU can tolerate resched_cpu() acting
      as a no-op, at least as long as it doesn't happen too often.
      
      This commit therefore makes resched_cpu() invoke resched_curr() only if
      the CPU is either online or is the current CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a0982dfa
  21. 28 11月, 2017 1 次提交
  22. 09 11月, 2017 1 次提交
    • P
      sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds · 765cc3a4
      Patrick Bellasi 提交于
      When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
      all SCHED_FEAT are turned into compile time constants being propagated
      to support compiler optimizations.
      
      Specifically, we expect that code blocks like this:
      
         if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
      	/* FEATURE CODE */
         }
      
      are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
      being removed by the compiler from the finale image.
      
      For this mechanism to properly work it's required for the compiler to
      have full access, from each translation unit, to whatever is the value
      defined by the sched_feat macro. This macro is defined as:
      
         #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
      
      and thus, the compiler can optimize that code only if the value of
      sysctl_sched_features is visible within each translation unit.
      
      Since:
      
         029632fb ("sched: Make separate sched*.c translation units")
      
      the scheduler code has been split into separate translation units
      however the definition of sysctl_sched_features is part of
      kernel/sched/core.c while, for all the other scheduler modules, it is
      visible only via kernel/sched/sched.h as an:
      
         extern const_debug unsigned int sysctl_sched_features
      
      Unfortunately, an extern reference does not allow the compiler to apply
      constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
      with code to load a memory reference and (eventually) doing an unconditional
      jump of a chunk of code.
      
      This mechanism is unavoidable when sched_features can be turned on and off at
      run-time. However, this is not the case for "production" kernels compiled with
      !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
      which cannot be changed at run-time and thus memory loads and jumps can be
      avoided altogether.
      
      This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
      of the sysctl_sched_features constant for each translation unit. This will
      ultimately allow the compiler to perform constants propagation and dead-code
      pruning.
      
      Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
      the patch, by running 30 iterations of:
      
         perf bench sched messaging --pipe --thread --group 4 --loop 50000
      
      on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
      powersave governor to rule out variations due to frequency scaling.
      
      Statistics on the reported completion time:
      
                         count     mean       std     min       99%     max
        v4.14-rc8         30.0  15.7831  0.176032  15.442  16.01226  16.014
        v4.14-rc8+patch   30.0  15.5033  0.189681  15.232  15.93938  15.962
      
      ... show a 1.8% speedup on average completion time and 0.5% speedup in the
      99 percentile.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NChris Redpath <chris.redpath@arm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      765cc3a4
  23. 27 10月, 2017 4 次提交