1. 29 9月, 2017 2 次提交
  2. 15 9月, 2017 2 次提交
    • T
      sched/wait: Introduce wakeup boomark in wake_up_page_bit · 11a19c7b
      Tim Chen 提交于
      Now that we have added breaks in the wait queue scan and allow bookmark
      on scan position, we put this logic in the wake_up_page_bit function.
      
      We can have very long page wait list in large system where multiple
      pages share the same wait list. We break the wake up walk here to allow
      other cpus a chance to access the list, and not to disable the interrupts
      when traversing the list for too long.  This reduces the interrupt and
      rescheduling latency, and excessive page wait queue lock hold time.
      
      [ v2: Remove bookmark_wake_function ]
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a19c7b
    • T
      sched/wait: Break up long wake list walk · 2554db91
      Tim Chen 提交于
      We encountered workloads that have very long wake up list on large
      systems. A waker takes a long time to traverse the entire wake list and
      execute all the wake functions.
      
      We saw page wait list that are up to 3700+ entries long in tests of
      large 4 and 8 socket systems. It took 0.8 sec to traverse such list
      during wake up. Any other CPU that contends for the list spin lock will
      spin for a long time. It is a result of the numa balancing migration of
      hot pages that are shared by many threads.
      
      Multiple CPUs waking are queued up behind the lock, and the last one
      queued has to wait until all CPUs did all the wakeups.
      
      The page wait list is traversed with interrupt disabled, which caused
      various problems. This was the original cause that triggered the NMI
      watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
      extending the NMI watch dog timer there helped.
      
      This patch bookmarks the waker's scan position in wake list and break
      the wake up walk, to allow access to the list before the waker resume
      its walk down the rest of the wait list. It lowers the interrupt and
      rescheduling latency.
      
      This patch also provides a performance boost when combined with the next
      patch to break up page wakeup list walk. We saw 22% improvement in the
      will-it-scale file pread2 test on a Xeon Phi system running 256 threads.
      
      [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
        simply access to flags. ]
      Reported-by: NKan Liang <kan.liang@intel.com>
      Tested-by: NKan Liang <kan.liang@intel.com>
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2554db91
  3. 12 9月, 2017 4 次提交
  4. 11 9月, 2017 1 次提交
  5. 09 9月, 2017 3 次提交
  6. 07 9月, 2017 2 次提交
  7. 29 8月, 2017 1 次提交
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
  8. 28 8月, 2017 1 次提交
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  9. 25 8月, 2017 4 次提交
    • P
      sched/debug: Optimize sched_domain sysctl generation · bbdacdfe
      Peter Zijlstra 提交于
      Currently we unconditionally destroy all sysctl bits and regenerate
      them after we've rebuild the domains (even if that rebuild is a
      no-op).
      
      And since we unconditionally (re)build the sysctl for all possible
      CPUs, onlining all CPUs gets us O(n^2) time. Instead change this to
      only rebuild the bits for CPUs we've actually installed new domains
      on.
      Reported-by: NOfer Levi(SW) <oferle@mellanox.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bbdacdfe
    • P
      sched/topology: Avoid pointless rebuild · 09e0dd8e
      Peter Zijlstra 提交于
      Fix partition_sched_domains() to try and preserve the existing machine
      wide domain instead of unconditionally destroying it. We do this by
      attempting to allocate the new single domain, only when that fails to
      we reuse the fallback_doms.
      
      When using fallback_doms we need to first destroy and then recreate
      because both the old and new could be backed by it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ofer Levi(SW) <oferle@mellanox.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet.Gupta1@synopsys.com <Vineet.Gupta1@synopsys.com>
      Cc: rusty@rustcorp.com.au <rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09e0dd8e
    • P
      sched/topology: Improve comments · a090c4f2
      Peter Zijlstra 提交于
      Mike provided a better comment for destroy_sched_domain() ...
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a090c4f2
    • S
      sched/topology: Fix memory leak in __sdt_alloc() · 213c5a45
      Shu Wang 提交于
      Found this issue by kmemleak: the 'sg' and 'sgc' pointers from
      __sdt_alloc() might be leaked as each domain holds many groups' ref,
      but in destroy_sched_domain(), it only declined the first group ref.
      
      Onlining and offlining a CPU can trigger this leak, and cause OOM.
      
      Reproducer for my 6 CPUs machine:
      
        while true
        do
            echo 0 > /sys/devices/system/cpu/cpu5/online;
            echo 1 > /sys/devices/system/cpu/cpu5/online;
        done
      
        unreferenced object 0xffff88007d772a80 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            c0 22 77 7d 00 88 ff ff 02 00 00 00 01 00 00 00  ."w}............
            40 2a 77 7d 00 88 ff ff 00 00 00 00 00 00 00 00  @*w}............
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810d94a8>] build_sched_domains+0x1e8/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      
        unreferenced object 0xffff88007d772a40 (size 64):
          comm "cpuhp/5", pid 39, jiffies 4294719962 (age 35.251s)
          hex dump (first 32 bytes):
            03 00 00 00 00 00 00 00 00 04 00 00 00 00 00 00  ................
            00 04 00 00 00 00 00 00 4f 3c fc ff 00 00 00 00  ........O<......
          backtrace:
            [<ffffffff8176525a>] kmemleak_alloc+0x4a/0xa0
            [<ffffffff8121efe1>] __kmalloc_node+0xf1/0x280
            [<ffffffff810da16d>] build_sched_domains+0xead/0xf20
            [<ffffffff810da674>] partition_sched_domains+0x304/0x360
            [<ffffffff81139557>] cpuset_update_active_cpus+0x17/0x40
            [<ffffffff810bdb2e>] sched_cpu_activate+0xae/0xc0
            [<ffffffff810900e0>] cpuhp_invoke_callback+0x90/0x400
            [<ffffffff81090597>] cpuhp_up_callbacks+0x37/0xb0
            [<ffffffff81090887>] cpuhp_thread_fun+0xd7/0xf0
            [<ffffffff810b37e0>] smpboot_thread_fn+0x110/0x160
            [<ffffffff810af5d9>] kthread+0x109/0x140
            [<ffffffff81770e45>] ret_from_fork+0x25/0x30
            [<ffffffffffffffff>] 0xffffffffffffffff
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NChunyu Hu <chuhu@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: liwang@redhat.com
      Link: http://lkml.kernel.org/r/1502351536-9108-1-git-send-email-shuwang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      213c5a45
  10. 18 8月, 2017 2 次提交
    • V
      cpufreq: schedutil: Always process remote callback with slow switching · c49cbc19
      Viresh Kumar 提交于
      The frequency update from the utilization update handlers can be divided
      into two parts:
      
      (A) Finding the next frequency
      (B) Updating the frequency
      
      While any CPU can do (A), (B) can be restricted to a group of CPUs only,
      depending on the current platform.
      
      For platforms where fast cpufreq switching is possible, both (A) and (B)
      are always done from the same CPU and that CPU should be capable of
      changing the frequency of the target CPU.
      
      But for platforms where fast cpufreq switching isn't possible, after
      doing (A) we wake up a kthread which will eventually do (B). This
      kthread is already bound to the right set of CPUs, i.e. only those which
      can change the frequency of CPUs of a cpufreq policy. And so any CPU
      can actually do (A) in this case, as the frequency is updated from the
      right set of CPUs only.
      
      Check cpufreq_can_do_remote_dvfs() only for the fast switching case.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c49cbc19
    • V
      cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily · e2cabe48
      Viresh Kumar 提交于
      Utilization update callbacks are now processed remotely, even on the
      CPUs that don't share cpufreq policy with the target CPU (if
      dvfs_possible_from_any_cpu flag is set).
      
      But in non-fast switch paths, the frequency is changed only from one of
      policy->related_cpus. This happens because the kthread which does the
      actual update is bound to a subset of CPUs (i.e. related_cpus).
      
      Allow frequency to be remotely updated as well (i.e. call
      __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e2cabe48
  11. 17 8月, 2017 3 次提交
    • P
      completion: Replace spin_unlock_wait() with lock/unlock pair · dec13c42
      Paul E. McKenney 提交于
      There is no agreed-upon definition of spin_unlock_wait()'s semantics,
      and it appears that all callers could do just as well with a lock/unlock
      pair.  This commit therefore replaces the spin_unlock_wait() call in
      completion_done() with spin_lock() followed immediately by spin_unlock().
      This should be safe from a performance perspective because the lock
      will be held only the wakeup happens really quickly.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      dec13c42
    • M
      membarrier: Provide expedited private command · 22e4ebb9
      Mathieu Desnoyers 提交于
      Implement MEMBARRIER_CMD_PRIVATE_EXPEDITED with IPIs using cpumask built
      from all runqueues for which current thread's mm is the same as the
      thread calling sys_membarrier. It executes faster than the non-expedited
      variant (no blocking). It also works on NOHZ_FULL configurations.
      
      Scheduler-wise, it requires a memory barrier before and after context
      switching between processes (which have different mm). The memory
      barrier before context switch is already present. For the barrier after
      context switch:
      
      * Our TSO archs can do RELEASE without being a full barrier. Look at
        x86 spin_unlock() being a regular STORE for example.  But for those
        archs, all atomics imply smp_mb and all of them have atomic ops in
        switch_mm() for mm_cpumask(), and on x86 the CR3 load acts as a full
        barrier.
      
      * From all weakly ordered machines, only ARM64 and PPC can do RELEASE,
        the rest does indeed do smp_mb(), so there the spin_unlock() is a full
        barrier and we're good.
      
      * ARM64 has a very heavy barrier in switch_to(), which suffices.
      
      * PPC just removed its barrier from switch_to(), but appears to be
        talking about adding something to switch_mm(). So add a
        smp_mb__after_unlock_lock() for now, until this is settled on the PPC
        side.
      
      Changes since v3:
      - Properly document the memory barriers provided by each architecture.
      
      Changes since v2:
      - Address comments from Peter Zijlstra,
      - Add smp_mb__after_unlock_lock() after finish_lock_switch() in
        finish_task_switch() to add the memory barrier we need after storing
        to rq->curr. This is much simpler than the previous approach relying
        on atomic_dec_and_test() in mmdrop(), which actually added a memory
        barrier in the common case of switching between userspace processes.
      - Return -EINVAL when MEMBARRIER_CMD_SHARED is used on a nohz_full
        kernel, rather than having the whole membarrier system call returning
        -ENOSYS. Indeed, CMD_PRIVATE_EXPEDITED is compatible with nohz_full.
        Adapt the CMD_QUERY mask accordingly.
      
      Changes since v1:
      - move membarrier code under kernel/sched/ because it uses the
        scheduler runqueue,
      - only add the barrier when we switch from a kernel thread. The case
        where we switch from a user-space thread is already handled by
        the atomic_dec_and_test() in mmdrop().
      - add a comment to mmdrop() documenting the requirement on the implicit
        memory barrier.
      
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      CC: Boqun Feng <boqun.feng@gmail.com>
      CC: Andrew Hunter <ahh@google.com>
      CC: Maged Michael <maged.michael@gmail.com>
      CC: gromer@google.com
      CC: Avi Kivity <avi@scylladb.com>
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NDave Watson <davejwatson@fb.com>
      22e4ebb9
    • S
      sched/completion: Document that reinit_completion() must be called after complete_all() · 9c878320
      Steven Rostedt 提交于
      The complete_all() function modifies the completion's "done" variable to
      UINT_MAX, and no other caller (wait_for_completion(), etc) will modify
      it back to zero. That means that any call to complete_all() must have a
      reinit_completion() before that completion can be used again.
      
      Document this fact by the complete_all() function.
      
      Also document that completion_done() will always return true if
      complete_all() is called.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170816131202.195c2f4b@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9c878320
  12. 12 8月, 2017 1 次提交
  13. 11 8月, 2017 2 次提交
  14. 10 8月, 2017 12 次提交