1. 20 10月, 2017 1 次提交
    • M
      membarrier: Provide register expedited private command · a961e409
      Mathieu Desnoyers 提交于
      This introduces a "register private expedited" membarrier command which
      allows eventual removal of important memory barrier constraints on the
      scheduler fast-paths. It changes how the "private expedited" membarrier
      command (new to 4.14) is used from user-space.
      
      This new command allows processes to register their intent to use the
      private expedited command.  This affects how the expedited private
      command introduced in 4.14-rc is meant to be used, and should be merged
      before 4.14 final.
      
      Processes are now required to register before using
      MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.
      
      This fixes a problem that arose when designing requested extensions to
      sys_membarrier() to allow JITs to efficiently flush old code from
      instruction caches.  Several potential algorithms are much less painful
      if the user register intent to use this functionality early on, for
      example, before the process spawns the second thread.  Registering at
      this time removes the need to interrupt each and every thread in that
      process at the first expedited sys_membarrier() system call.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a961e409
  2. 10 10月, 2017 3 次提交
    • P
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra 提交于
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      024c9d2f
    • P
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra 提交于
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
  3. 29 9月, 2017 2 次提交
  4. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  5. 12 9月, 2017 4 次提交
  6. 11 9月, 2017 1 次提交
  7. 09 9月, 2017 3 次提交
  8. 07 9月, 2017 2 次提交
  9. 29 8月, 2017 1 次提交
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
  10. 28 8月, 2017 1 次提交
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  11. 25 8月, 2017 4 次提交
    • P
      sched/debug: Optimize sched_domain sysctl generation · bbdacdfe
      Peter Zijlstra 提交于
      Currently we unconditionally destroy all sysctl bits and regenerate
      them after we've rebuild the domains (even if that rebuild is a
      no-op).
      
      And since we unconditionally (re)build the sysctl for all possible
      CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
      only rebuild the bits for CPUs we've actually installed new domains
      on.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bbdacdfe
    • P
      sched/topology: Avoid pointless rebuild · 09e0dd8e
      Peter Zijlstra 提交于
      Fix partition_sched_domains() to try and preserve the existing machine
      wide domain instead of unconditionally destroying it. We do this by
      attempting to allocate the new single domain, only when that fails to
      we reuse the fallback_doms.
      
      When using fallback_doms we need to first destroy and then recreate
      because both the old and new could be backed by it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ofer Levi(SW) <oferle@mellanox.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09e0dd8e
    • P
      sched/topology: Improve comments · a090c4f2
      Peter Zijlstra 提交于
      Mike provided a better comment for destroy_sched_domain() ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a090c4f2
    • S
      sched/topology: Fix memory leak in __sdt_alloc() · 213c5a45
      Shu Wang 提交于
      Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
      __sdt_alloc() might be leaked as each domain holds many groups' ref,
      but in destroy_sched_domain(), it only declined the first group ref.
      
      Onlining and offlining a CPU can trigger this leak, and cause OOM.
      
      Reproducer for my 6 CPUs machine:
      
        while true
        do
            echo 0 > /sys/devices/system/cpu/cpu5/online;
            echo 1 > /sys/devices/system/cpu/cpu5/online;
        done
      
        unreferenced object 0xffff88007d772a80 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00  ."w}............
            40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00  @*w}............
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      
        unreferenced object 0xffff88007d772a40 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00  ................
            00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00  ........O<......
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810da16d>] build_sched_domains+0xead/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NChunyu Hu <chuhu@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: liwang@redhat.com
      Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      213c5a45
  12. 18 8月, 2017 2 次提交
    • V
      cpufreq: schedutil: Always process remote callback with slow switching · c49cbc19
      Viresh Kumar 提交于
      The frequency update from the utilization update handlers can be divided
      into two parts:
      
      (A) Finding the next frequency
      (B) Updating the frequency
      
      While any CPU can do (A), (B) can be restricted to a group of CPUs only,
      depending on the current platform.
      
      For platforms where fast cpufreq switching is possible, both (A) and (B)
      are always done from the same CPU and that CPU should be capable of
      changing the frequency of the target CPU.
      
      But for platforms where fast cpufreq switching isn't possible, after
      doing (A) we wake up a kthread which will eventually do (B). This
      kthread is already bound to the right set of CPUs, i.e. only those which
      can change the frequency of CPUs of a cpufreq policy. And so any CPU
      can actually do (A) in this case, as the frequency is updated from the
      right set of CPUs only.
      
      Check cpufreq_can_do_remote_dvfs() only for the fast switching case.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c49cbc19
    • V
      cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily · e2cabe48
      Viresh Kumar 提交于
      Utilization update callbacks are now processed remotely, even on the
      CPUs that don't share cpufreq policy with the target CPU (if
      dvfs_possible_from_any_cpu flag is set).
      
      But in non-fast switch paths, the frequency is changed only from one of
      policy->related_cpus. This happens because the kthread which does the
      actual update is bound to a subset of CPUs (i.e. related_cpus).
      
      Allow frequency to be remotely updated as well (i.e. call
      __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e2cabe48
  13. 17 8月, 2017 3 次提交
    • P
      completion: Replace spin_unlock_wait() with lock/unlock pair · dec13c42
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore replaces the spin_unlock_wait() call in
      completion_done() with spin_lock() followed immediately by spin_unlock().
      This should be safe from a performance perspective because the lock
      will be held only the wakeup happens really quickly.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dec13c42
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
    • S
      sched/completion: Document that reinit_completion() must be called after complete_all() · 9c878320
      Steven Rostedt 提交于
      The complete_all() function modifies the completion's "done" variable to
      UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
      it back to zero. That means that any call to complete_all() must have a
      reinit_completion() before that completion can be used again.
      
      Document this fact by the complete_all() function.
      
      Also document that completion_done() will always return true if
      complete_all() is called.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c878320
  14. 12 8月, 2017 1 次提交
  15. 11 8月, 2017 2 次提交
  16. 10 8月, 2017 8 次提交
    • A
      sched/autogroup: Fix error reporting printk text in autogroup_create() · 1e58565e
      Anshuman Khandual 提交于
      Its kzalloc() not kmalloc() which has failed earlier. While here
      remove a redundant empty line.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20170802084300.29487-1-khandual@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1e58565e
    • P
      sched/fair: Fix wake_affine() for !NUMA_BALANCING · 90001d67
      Peter Zijlstra 提交于
      In commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Rik changed wake_affine to consider NUMA information when balancing
      between LLC domains.
      
      There are a number of problems here which this patch tries to address:
      
       - LLC < NODE; in this case we'd use the wrong information to balance
       - !NUMA_BALANCING: in this case, the new code doesn't do any
         balancing at all
       - re-computes the NUMA data for every wakeup, this can mean iterating
         up to 64 CPUs for every wakeup.
       - default affine wakeups inside a cache
      
      We address these by saving the load/capacity values for each
      sched_domain during regular load-balance and using these values in
      wake_affine_llc(). The obvious down-side to using cached values is
      that they can be too old and poorly reflect reality.
      
      But this way we can use LLC wide information and thus not rely on
      assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
      we have to aggegate two nodes (or even cache domains) worth of CPUs
      for each wakeup.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      [ Minor readability improvements. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90001d67
    • B
      locking/lockdep: Apply crossrelease to completions · cd8084f9
      Byungchul Park 提交于
      Although wait_for_completion() and its family can cause deadlock, the
      lock correctness validator could not be applied to them until now,
      because things like complete() are usually called in a different context
      from the waiting context, which violates lockdep's assumption.
      
      Thanks to CONFIG_LOCKDEP_CROSSRELEASE, we can now apply the lockdep
      detector to those completion operations. Applied it.
      Signed-off-by: NByungchul Park <byungchul.park@lge.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: boqun.feng@gmail.com
      Cc: kernel-team@lge.com
      Cc: kirill@shutemov.name
      Cc: npiggin@gmail.com
      Cc: walken@google.com
      Cc: willy@infradead.org
      Link: http://lkml.kernel.org/r/1502089981-21272-10-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd8084f9
    • P
      locking: Introduce smp_mb__after_spinlock() · d89e588c
      Peter Zijlstra 提交于
      Since its inception, our understanding of ACQUIRE, esp. as applied to
      spinlocks, has changed somewhat. Also, I wonder if, with a simple
      change, we cannot make it provide more.
      
      The problem with the comment is that the STORE done by spin_lock isn't
      itself ordered by the ACQUIRE, and therefore a later LOAD can pass over
      it and cross with any prior STORE, rendering the default WMB
      insufficient (pointed out by Alan).
      
      Now, this is only really a problem on PowerPC and ARM64, both of
      which already defined smp_mb__before_spinlock() as a smp_mb().
      
      At the same time, we can get a much stronger construct if we place
      that same barrier _inside_ the spin_lock(). In that case we upgrade
      the RCpc spinlock to an RCsc.  That would make all schedule() calls
      fully transitive against one another.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d89e588c
    • B
      sched/wait: Remove the lockless swait_active() check in swake_up*() · 35a2897c
      Boqun Feng 提交于
      Steven Rostedt reported a potential race in RCU core because of
      swake_up():
      
              CPU0                            CPU1
              ----                            ----
                                      __call_rcu_core() {
      
                                       spin_lock(rnp_root)
                                       need_wake = __rcu_start_gp() {
                                        rcu_start_gp_advanced() {
                                         gp_flags = FLAG_INIT
                                        }
                                       }
      
       rcu_gp_kthread() {
         swait_event_interruptible(wq,
              gp_flags & FLAG_INIT) {
         spin_lock(q->lock)
      
                                      *fetch wq->task_list here! *
      
         list_add(wq->task_list, q->task_list)
         spin_unlock(q->lock);
      
         *fetch old value of gp_flags here *
      
                                       spin_unlock(rnp_root)
      
                                       rcu_gp_kthread_wake() {
                                        swake_up(wq) {
                                         swait_active(wq) {
                                          list_empty(wq->task_list)
      
                                         } * return false *
      
        if (condition) * false *
          schedule();
      
      In this case, a wakeup is missed, which could cause the rcu_gp_kthread
      waits for a long time.
      
      The reason of this is that we do a lockless swait_active() check in
      swake_up(). To fix this, we can either 1) add a smp_mb() in swake_up()
      before swait_active() to provide the proper order or 2) simply remove
      the swait_active() in swake_up().
      
      The solution 2 not only fixes this problem but also keeps the swait and
      wait API as close as possible, as wake_up() doesn't provide a full
      barrier and doesn't do a lockless check of the wait queue either.
      Moreover, there are users already using swait_active() to do their quick
      checks for the wait queues, so it make less sense that swake_up() and
      swake_up_all() do this on their own.
      
      This patch then removes the lockless swait_active() check in swake_up()
      and swake_up_all().
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NBoqun Feng <boqun.feng@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Krister Johansen <kjlx@templeofstupid.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170615041828.zk3a3sfyudm5p6nl@tardisSigned-off-by: NIngo Molnar <mingo@kernel.org>
      35a2897c
    • X
      sched/debug: Intruduce task_state_to_char() helper function · 20435d84
      Xie XiuQi 提交于
      Now that we have more than one place to get the task state,
      intruduce the task_state_to_char() helper function to save some code.
      
      No functionality changed.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <cj.chengjian@huawei.com>
      Cc: <huawei.libin@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1502095463-160172-3-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      20435d84
    • X
      sched/debug: Show task state in /proc/sched_debug · e8c16495
      Xie XiuQi 提交于
      Currently we print the runnable task in /proc/sched_debug, but
      there is no task state information.
      
      We don't know which task is in the runqueue and which task is sleeping.
      
      Add task state in the runnable task list, like this:
      
        runnable tasks:
         S           task   PID         tree-key  switches  prio     wait-time             sum-exec        sum-sleep
        -----------------------------------------------------------------------------------------------------------
         S   watchdog/239  1452       -11.917445      2811     0         0.000000         8.949306         0.000000 7 0 /
         S  migration/239  1453     20686.367740         8     0         0.000000     16215.720897         0.000000 7 0 /
         S  ksoftirqd/239  1454    115383.841071        12   120         0.000000         0.200683         0.000000 7 0 /
        >R           test 21287      4872.190970       407   120         0.000000      4874.911790         0.000000 7 0 /autogroup-150
         R           test 21288      4868.385454       401   120         0.000000      3672.341489         0.000000 7 0 /autogroup-150
         R           test 21289      4868.326776       384   120         0.000000      3424.934159         0.000000 7 0 /autogroup-150
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <cj.chengjian@huawei.com>
      Cc: <huawei.libin@huawei.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1502095463-160172-2-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8c16495
    • A
      sched/debug: Use task_pid_nr_ns in /proc/$pid/sched · 74dc3384
      Aleksa Sarai 提交于
      It appears as though the addition of the PID namespace did not update
      the output code for /proc/*/sched, which resulted in it providing PIDs
      that were not self-consistent with the /proc mount. This additionally
      made it trivial to detect whether a process was inside &init_pid_ns from
      userspace, making container detection trivial:
      
         https://github.com/jessfraz/amicontained
      
      This leads to situations such as:
      
        % unshare -pmf
        % mount -t proc proc /proc
        % head -n1 /proc/1/sched
        head (10047, #threads: 1)
      
      Fix this by just using task_pid_nr_ns for the output of /proc/*/sched.
      All of the other uses of task_pid_nr in kernel/sched/debug.c are from a
      sysctl context and thus don't need to be namespaced.
      Signed-off-by: NAleksa Sarai <asarai@suse.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Jess Frazelle <acidburn@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: cyphar@cyphar.com
      Link: http://lkml.kernel.org/r/20170806044141.5093-1-asarai@suse.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74dc3384