1. 09 3月, 2018 9 次提交
  2. 04 3月, 2018 4 次提交
    • I
      sched/core: Undefine tracepoint creation at the end of core.c · 14a7405b
      Ingo Molnar 提交于
      Make it easier to concatenate all the scheduler .c files for single-module
      compilation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      14a7405b
    • I
      sched/deadline, rt: Rename queue_push_tasks/queue_pull_task to create separate namespace · 02d8ec94
      Ingo Molnar 提交于
      There are similarly named functions in both of these modules:
      
        kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/deadline.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/deadline.c:	queue_push_tasks(rq);
        kernel/sched/deadline.c:	queue_pull_task(rq);
        kernel/sched/deadline.c:			queue_push_tasks(rq);
        kernel/sched/deadline.c:			queue_pull_task(rq);
        kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/rt.c:static inline void queue_pull_task(struct rq *rq)
        kernel/sched/rt.c:static inline void queue_push_tasks(struct rq *rq)
        kernel/sched/rt.c:	queue_push_tasks(rq);
        kernel/sched/rt.c:	queue_pull_task(rq);
        kernel/sched/rt.c:			queue_push_tasks(rq);
        kernel/sched/rt.c:			queue_pull_task(rq);
      
      ... which makes it harder to grep for them. Prefix them with
      deadline_ and rt_, respectively.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      02d8ec94
    • I
      sched/idle: Merge kernel/sched/idle.c and kernel/sched/idle_task.c · a92057e1
      Ingo Molnar 提交于
      Merge these two small .c modules as they implement two aspects
      of idle task handling.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a92057e1
    • I
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar 提交于
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      325ea10c
  3. 03 3月, 2018 3 次提交
    • I
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar 提交于
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      97fb7a0a
    • M
      sched/deadline: Clean up various coding style details · c2e51382
      Mario Leinweber 提交于
      - Fixed style error: Missing space before the open parenthesis
      - Fixed style warnings: 2x Missing blank line after declaration
      
      One warning left: else after return
       (I don't feel comfortable fixing that without side effects)
      Signed-off-by: NMario Leinweber <marioleinweber@web.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180302182007.28691-1-marioleinweber@web.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c2e51382
    • D
      memremap: fix softlockup reports at teardown · 949b9325
      Dan Williams 提交于
      The cond_resched() currently in the setup path needs to be duplicated in
      the teardown path. Rather than require each instance of
      for_each_device_pfn() to open code the same sequence, embed it in the
      helper.
      
      Link: https://github.com/intel/ixpdimm_sw/issues/11
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 71389703 ("mm, zone_device: Replace {get, put}_zone_device_page()...")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      949b9325
  4. 01 3月, 2018 1 次提交
    • L
      timers: Forward timer base before migrating timers · c52232a4
      Lingutla Chandrasekhar 提交于
      On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
      live CPU. This happens from the control thread which initiated the unplug.
      
      If the CPU on which the control thread runs came out from a longer idle
      period then the base clock of that CPU might be stale because the control
      thread runs prior to any event which forwards the clock.
      
      In such a case the timers from the unplugged CPU are queued on the live CPU
      based on the stale clock which can cause large delays due to increased
      granularity of the outer timer wheels which are far away from base:;clock.
      
      But there is a worse problem than that. The following sequence of events
      illustrates it:
      
       - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.
      
         The timer is queued at wheel level 2, with resulting expiry time = 60032
         (due to level granularity).
      
       - CPU1 enters idle @60007, with next timer expiry @60020.
      
       - CPU0 is hotplugged at @60009
      
       - CPU1 exits idle and runs the control thread which migrates the
         timers from CPU0
      
         timer1 is now queued in level 0 for immediate handling in the next
         softirq because the requested expiry time 59969 is before CPU1 base->clk
         60007
      
       - CPU1 runs code which forwards the base clock which succeeds because the
         next expiring timer. which was collected at idle entry time is still set
         to 60020.
      
         So it forwards beyond 60007 and therefore misses to expire the migrated
         timer1. That timer gets expired when the wheel wraps around again, which
         takes between 63 and 630ms depending on the HZ setting.
      
      Address both problems by invoking forward_timer_base() for the control CPUs
      timer base. All other places, which might run into a similar problem
      (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
      that.
      
      [ tglx: Massaged comment and changelog ]
      
      Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Co-developed-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NNeeraj Upadhyay <neeraju@codeaurora.org>
      Signed-off-by: NLingutla Chandrasekhar <clingutla@codeaurora.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
      Cc: linux-arm-msm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
      c52232a4
  5. 27 2月, 2018 1 次提交
    • P
      printk: Wake klogd when passing console_lock owner · c14376de
      Petr Mladek 提交于
      wake_klogd is a local variable in console_unlock(). The information
      is lost when the console_lock owner using the busy wait added by
      the commit dbdda842 ("printk: Add console owner and waiter
      logic to load balance console writes"). The following race is
      possible:
      
      CPU0				CPU1
      console_unlock()
      
        for (;;)
           /* calling console for last message */
      
      				printk()
      				  log_store()
      				    log_next_seq++;
      
           /* see new message */
           if (seen_seq != log_next_seq) {
      	wake_klogd = true;
      	seen_seq = log_next_seq;
           }
      
           console_lock_spinning_enable();
      
      				  if (console_trylock_spinning())
      				     /* spinning */
      
           if (console_lock_spinning_disable_and_check()) {
      	printk_safe_exit_irqrestore(flags);
      	return;
      
      				  console_unlock()
      				    if (seen_seq != log_next_seq) {
      				    /* already seen */
      				    /* nothing to do */
      
      Result: Nobody would wakeup klogd.
      
      One solution would be to make a global variable from wake_klogd.
      But then we would need to manipulate it under a lock or so.
      
      This patch wakes klogd also when console_lock is passed to the
      spinning waiter. It looks like the right way to go. Also userspace
      should have a chance to see and store any "flood" of messages.
      
      Note that the very late klogd wake up was a historic solution.
      It made sense on single CPU systems or when sys_syslog() operations
      were synchronized using the big kernel lock like in v2.1.113.
      But it is questionable these days.
      
      Fixes: dbdda842 ("printk: Add console owner and waiter logic to load balance console writes")
      Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: linux-kernel@vger.kernel.org
      Cc: Tejun Heo <tj@kernel.org>
      Suggested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      c14376de
  6. 24 2月, 2018 1 次提交
    • D
      bpf: allow xadd only on aligned memory · ca369602
      Daniel Borkmann 提交于
      The requirements around atomic_add() / atomic64_add() resp. their
      JIT implementations differ across architectures. E.g. while x86_64
      seems just fine with BPF's xadd on unaligned memory, on arm64 it
      triggers via interpreter but also JIT the following crash:
      
        [  830.864985] Unable to handle kernel paging request at virtual address ffff8097d7ed6703
        [...]
        [  830.916161] Internal error: Oops: 96000021 [#1] SMP
        [  830.984755] CPU: 37 PID: 2788 Comm: test_verifier Not tainted 4.16.0-rc2+ #8
        [  830.991790] Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.29 07/17/2017
        [  830.998998] pstate: 80400005 (Nzcv daif +PAN -UAO)
        [  831.003793] pc : __ll_sc_atomic_add+0x4/0x18
        [  831.008055] lr : ___bpf_prog_run+0x1198/0x1588
        [  831.012485] sp : ffff00001ccabc20
        [  831.015786] x29: ffff00001ccabc20 x28: ffff8017d56a0f00
        [  831.021087] x27: 0000000000000001 x26: 0000000000000000
        [  831.026387] x25: 000000c168d9db98 x24: 0000000000000000
        [  831.031686] x23: ffff000008203878 x22: ffff000009488000
        [  831.036986] x21: ffff000008b14e28 x20: ffff00001ccabcb0
        [  831.042286] x19: ffff0000097b5080 x18: 0000000000000a03
        [  831.047585] x17: 0000000000000000 x16: 0000000000000000
        [  831.052885] x15: 0000ffffaeca8000 x14: 0000000000000000
        [  831.058184] x13: 0000000000000000 x12: 0000000000000000
        [  831.063484] x11: 0000000000000001 x10: 0000000000000000
        [  831.068783] x9 : 0000000000000000 x8 : 0000000000000000
        [  831.074083] x7 : 0000000000000000 x6 : 000580d428000000
        [  831.079383] x5 : 0000000000000018 x4 : 0000000000000000
        [  831.084682] x3 : ffff00001ccabcb0 x2 : 0000000000000001
        [  831.089982] x1 : ffff8097d7ed6703 x0 : 0000000000000001
        [  831.095282] Process test_verifier (pid: 2788, stack limit = 0x0000000018370044)
        [  831.102577] Call trace:
        [  831.105012]  __ll_sc_atomic_add+0x4/0x18
        [  831.108923]  __bpf_prog_run32+0x4c/0x70
        [  831.112748]  bpf_test_run+0x78/0xf8
        [  831.116224]  bpf_prog_test_run_xdp+0xb4/0x120
        [  831.120567]  SyS_bpf+0x77c/0x1110
        [  831.123873]  el0_svc_naked+0x30/0x34
        [  831.127437] Code: 97fffe97 17ffffec 00000000 f9800031 (885f7c31)
      
      Reason for this is because memory is required to be aligned. In
      case of BPF, we always enforce alignment in terms of stack access,
      but not when accessing map values or packet data when the underlying
      arch (e.g. arm64) has CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS set.
      
      xadd on packet data that is local to us anyway is just wrong, so
      forbid this case entirely. The only place where xadd makes sense in
      fact are map values; xadd on stack is wrong as well, but it's been
      around for much longer. Specifically enforce strict alignment in case
      of xadd, so that we handle this case generically and avoid such crashes
      in the first place.
      
      Fixes: 17a52670 ("bpf: verifier (add verifier core)")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      ca369602
  7. 23 2月, 2018 4 次提交
    • T
      genirq/matrix: Handle CPU offlining proper · 651ca2c0
      Thomas Gleixner 提交于
      At CPU hotunplug the corresponding per cpu matrix allocator is shut down and
      the allocated interrupt bits are discarded under the assumption that all
      allocated bits have been either migrated away or shut down through the
      managed interrupts mechanism.
      
      This is not true because interrupts which are not started up might have a
      vector allocated on the outgoing CPU. When the interrupt is started up
      later or completely shutdown and freed then the allocated vector is handed
      back, triggering warnings or causing accounting issues which result in
      suspend failures and other issues.
      
      Change the CPU hotplug mechanism of the matrix allocator so that the
      remaining allocations at unplug time are preserved and global accounting at
      hotplug is correctly readjusted to take the dormant vectors into account.
      
      Fixes: 2f75d9e1 ("genirq: Implement bitmap matrix allocator")
      Reported-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NYuriy Vostrikov <delamonpansie@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180222112316.849980972@linutronix.de
      651ca2c0
    • Y
      bpf: fix rcu lockdep warning for lpm_trie map_free callback · 6c5f6102
      Yonghong Song 提交于
      Commit 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      fixed a memory leak and removed unnecessary locks in map_free callback function.
      Unfortrunately, it introduced a lockdep warning. When lockdep checking is turned on,
      running tools/testing/selftests/bpf/test_lpm_map will have:
      
        [   98.294321] =============================
        [   98.294807] WARNING: suspicious RCU usage
        [   98.295359] 4.16.0-rc2+ #193 Not tainted
        [   98.295907] -----------------------------
        [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious rcu_dereference_check() usage!
        [   98.297657]
        [   98.297657] other info that might help us debug this:
        [   98.297657]
        [   98.298663]
        [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
        [   98.299536] 2 locks held by kworker/2:1/54:
        [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
        [   98.301381]  #1:  ((work_completion)(&map->work)){+.+.}, at: [<00000000196bc1f0>] process_one_work+0x157/0x5c0
      
      Since actual trie tree removal happens only after no other
      accesses to the tree are possible, replacing
        rcu_dereference_protected(*slot, lockdep_is_held(&trie->lock))
      with
        rcu_dereference_protected(*slot, 1)
      fixed the issue.
      
      Fixes: 9a3efb6b ("bpf: fix memory leak in lpm_trie map_free callback function")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6c5f6102
    • E
      bpf: add schedule points in percpu arrays management · 32fff239
      Eric Dumazet 提交于
      syszbot managed to trigger RCU detected stalls in
      bpf_array_free_percpu()
      
      It takes time to allocate a huge percpu map, but even more time to free
      it.
      
      Since we run in process context, use cond_resched() to yield cpu if
      needed.
      
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      32fff239
    • L
      efivarfs: Limit the rate for non-root to read files · bef3efbe
      Luck, Tony 提交于
      Each read from a file in efivarfs results in two calls to EFI
      (one to get the file size, another to get the actual data).
      
      On X86 these EFI calls result in broadcast system management
      interrupts (SMI) which affect performance of the whole system.
      A malicious user can loop performing reads from efivarfs bringing
      the system to its knees.
      
      Linus suggested per-user rate limit to solve this.
      
      So we add a ratelimit structure to "user_struct" and initialize
      it for the root user for no limit. When allocating user_struct for
      other users we set the limit to 100 per second. This could be used
      for other places that want to limit the rate of some detrimental
      user action.
      
      In efivarfs if the limit is exceeded when reading, we take an
      interruptible nap for 50ms and check the rate limit again.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bef3efbe
  8. 22 2月, 2018 3 次提交
  9. 21 2月, 2018 14 次提交
    • J
      extable: Make init_kernel_text() global · 9fbcc57a
      Josh Poimboeuf 提交于
      Convert init_kernel_text() to a global function and use it in a few
      places instead of manually comparing _sinittext and _einittext.
      
      Note that kallsyms.h has a very similar function called
      is_kernel_inittext(), but its end check is inclusive.  I'm not sure
      whether that's intentional behavior, so I didn't touch it.
      Suggested-by: NJason Baron <jbaron@akamai.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/4335d02be8d45ca7d265d2f174251d0b7ee6c5fd.1519051220.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9fbcc57a
    • J
      jump_label: Warn on failed jump_label patching attempt · dc1dd184
      Josh Poimboeuf 提交于
      Currently when the jump label code encounters an address which isn't
      recognized by kernel_text_address(), it just silently fails.
      
      This can be dangerous because jump labels are used in a variety of
      places, and are generally expected to work.  Convert the silent failure
      to a warning.
      
      This won't warn about attempted writes to tracepoints in __init code
      after initmem has been freed, as those are already guarded by the
      entry->code check.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/de3a271c93807adb7ed48f4e946b4f9156617680.1519051220.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dc1dd184
    • J
      jump_label: Explicitly disable jump labels in __init code · 33352244
      Josh Poimboeuf 提交于
      After initmem has been freed, any jump labels in __init code are
      prevented from being written to by the kernel_text_address() check in
      __jump_label_update().  However, this check is quite broad.  If
      kernel_text_address() were to return false for any other reason, the
      jump label write would fail silently with no warning.
      
      For jump labels in module init code, entry->code is set to zero to
      indicate that the entry is disabled.  Do the same thing for core kernel
      init code.  This makes the behavior more consistent, and will also make
      it more straightforward to detect non-init jump label write failures in
      the next patch.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/c52825c73f3a174e8398b6898284ec20d4deb126.1519051220.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      33352244
    • F
      sched/nohz: Remove the 1 Hz tick code · dcdedb24
      Frederic Weisbecker 提交于
      Now that the 1Hz tick is offloaded to workqueues, we can safely remove
      the residual code that used to handle it locally.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-7-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      dcdedb24
    • F
      sched/isolation: Offload residual 1Hz scheduler tick · d84b3131
      Frederic Weisbecker 提交于
      When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
      keep the scheduler stats alive. However this residual tick is a burden
      for bare metal tasks that can't stand any interruption at all, or want
      to minimize them.
      
      The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
      outsource these scheduler ticks to the global workqueue so that a
      housekeeping CPU handles those remotely. The sched_class::task_tick()
      implementations have been audited and look safe to be called remotely
      as the target runqueue and its current task are passed in parameter
      and don't seem to be accessed locally.
      
      Note that in the case of using isolcpus, it's still up to the user to
      affine the global workqueues to the housekeeping CPUs through
      /sys/devices/virtual/workqueue/cpumask or domains isolation
      "isolcpus=nohz,domain".
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d84b3131
    • F
      sched/isolation: Isolate workqueues when "nohz_full=" is set · 1bda3f80
      Frederic Weisbecker 提交于
      As we prepare for offloading the residual 1hz scheduler ticks to
      workqueue, let's affine those to housekeepers so that they don't
      interrupt the CPUs that don't want to be disturbed.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1bda3f80
    • F
      nohz: Allow to check if remote CPU tick is stopped · 22ab8bc0
      Frederic Weisbecker 提交于
      This check is racy but provides a good heuristic to determine whether
      a CPU may need a remote tick or not.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-4-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      22ab8bc0
    • F
      nohz: Convert tick_nohz_tick_stopped() to bool · a3642983
      Frederic Weisbecker 提交于
      It makes this function more self-explanatory about what it does and how
      to use it.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-3-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a3642983
    • F
      sched/core: Rename init_rq_hrtick() to hrtick_rq_init() · 77a021be
      Frederic Weisbecker 提交于
      Do that rename in order to normalize the hrtick namespace.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Link: http://lkml.kernel.org/r/1519186649-3242-2-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      77a021be
    • M
      sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() · 7347fc87
      Mel Gorman 提交于
      If wake_affine() pulls a task to another node for any reason and the node is
      no longer preferred then temporarily stop automatic NUMA balancing pulling
      the task back. Otherwise, tasks with a strong waker/wakee relationship
      may constantly fight automatic NUMA balancing over where a task should
      be placed.
      
      Once again netperf is interesting here. The performance barely changes
      but automatic NUMA balancing is interesting:
      
       Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
       Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
       Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
       Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
       Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
       Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
       Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
       Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
       Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
       Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
       Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
       Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
       Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
       Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)
      
       NUMA alloc hit                 1465986     1423090
       NUMA alloc miss                      0           0
       NUMA interleave hit                  0           0
       NUMA alloc local               1465897     1423003
       NUMA base PTE updates             1473        1420
       NUMA huge PMD updates                0           0
       NUMA page range updates           1473        1420
       NUMA hint faults                  1383        1312
       NUMA hint local faults             451         124
       NUMA hint local percent             32           9
      
      There is a slight degrading in performance but there are slightly fewer
      NUMA faults. There is a large drop in the percentage of local faults but
      the bulk of migrations for netperf are in small shared libraries so it's
      reflecting the fact that automatic NUMA balancing has backed off. This is
      a case where despite wake_affine() and automatic NUMA balancing fighting
      for placement that there is a marginal benefit to rescheduling to local
      data quickly. However, it should be noted that wake_affine() and automatic
      NUMA balancing fighting each other constantly is undesirable.
      
      However, the benefit in other cases is large. This is the result for NAS
      with the D class sizing on a 4-socket machine:
      
       nas-mpi
                                 4.15.0                 4.15.0
                           sdnuma-v1r23       delayretry-v1r23
       Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
       Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
       Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
       Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
       Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)
      
                     4.15.0      4.15.0
               sdnuma-v1r23delayretry-v1r23
       User        75665.20    70413.30
       System      20321.59     8861.67
       Elapsed       766.13      634.92
      
       Minor Faults                  16528502     7127941
       Major Faults                      4553        5068
       NUMA alloc local               6963197     6749135
       NUMA base PTE updates        366409093   107491434
       NUMA huge PMD updates           687556      198880
       NUMA page range updates      718437765   209317994
       NUMA hint faults              13643410     4601187
       NUMA hint local faults         9212593     3063996
       NUMA hint local percent             67          66
      
      Note the massive reduction in system CPU usage even though the percentage
      of local faults is barely affected. There is a massive reduction in the
      number of PTE updates showing that automatic NUMA balancing has backed off.
      A critical observation is also that there is a massive reduction in minor
      faults which is due to far fewer NUMA hinting faults being trapped.
      
      There were questions on NAS OMP and how it behaved related to threads
      being bound to CPUs. First, there are more gains than losses with this
      patch applied and a reduction in system CPU usage:
      
      nas-omp
                            4.16.0-rc1             4.16.0-rc1
                           sdnuma-v2r1        delayretry-v2r1
      Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
      Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
      Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
      Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
      Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
      Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
      Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
      Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)
      
                4.16.0-rc1  4.16.0-rc1
               sdnuma-v2r1delayretry-v2r1
      User       305633.08   296751.91
      System        451.75      357.80
      Elapsed      2595.73     2368.13
      
      However, it does not close the gap between binding and being unbound. There
      is negligible difference between the performance of the baseline and a
      patched kernel when threads are bound so it is not presented here:
      
                            4.16.0-rc1             4.16.0-rc1
                       delayretry-bind     delayretry-unbound
      Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
      Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
      Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
      Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
      Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
      Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
      Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
      Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)
      
                4.16.0-rc1  4.16.0-rc1
              delayretry-binddelayretry-unbound
      User       277731.25   296751.91
      System        261.29      357.80
      Elapsed      2100.55     2368.13
      
      Unfortunately, while performance is improved by the patch, there is still
      quite a long way to go before it's equivalent to hard binding.
      
      Other workloads like hackbench, tbench, dbench and schbench are barely
      affected. dbench shows a mix of gains and losses depending on the machine
      although in general, the results are more stable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7347fc87
    • M
      sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on · 2c833627
      Mel Gorman 提交于
      find_idlest_group() compares a local group with each other group to select
      the one that is most idle. When comparing groups in different NUMA domains,
      a very slight imbalance is enough to select a remote NUMA node even if the
      runnable load on both groups is 0 or close to 0. This ignores the cost of
      remote accesses entirely and is a problem when selecting the CPU for a
      newly forked task to run on.  This is problematic when a forking server
      is almost guaranteed to run on a remote node incurring numerous remote
      accesses and potentially causing automatic NUMA balancing to try migrate
      the task back or migrate the data to another node. Similar weirdness is
      observed if a basic shell command pipes output to another as each process
      in the pipeline is likely to start on different nodes and then get adjusted
      later by wake_affine().
      
      This patch adds imbalance to remote domains when considering whether to
      select CPUs from remote domains. If the local domain is selected, imbalance
      will still be used to try select a CPU from a lower scheduler domain's group
      instead of stacking tasks on the same CPU.
      
      A variety of workloads and machines were tested and as expected, there is no
      difference on UMA. The difference on NUMA can be dramatic. This is a comparison
      of elapsed times running the git regression test suite. It's fork-intensive with
      short-lived processes:
      
                                        4.15.0                 4.15.0
                                  noexit-v1r23           sdnuma-v1r23
       Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
       Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
       Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
       Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
       Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)
      
                     4.15.0      4.15.0
               noexit-v1r23 sdnuma-v1r23
       User         5434.12     5188.41
       System       4878.77     3467.09
       Elapsed     10259.06     8624.21
      
      That shows a considerable reduction in elapsed times. It's important to
      note that automatic NUMA balancing does not affect this load as processes
      are too short-lived.
      
      There is also a noticable impact on hackbench such as this example using
      processes and pipes:
      
       hackbench-process-pipes
                                     4.15.0                 4.15.0
                               noexit-v1r23           sdnuma-v1r23
       Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
       Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
       Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
       Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
       Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
       Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
       Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
       Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
       Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
       Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
       Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
       Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
       Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
       Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
       Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)
      
      It's not a universal win as there are occasions when spreading wide and
      quickly is a benefit but it's more of a win than it is a loss. For other
      workloads, there is little difference but netperf is interesting. Without
      the patch, the server and client starts on different nodes but quickly get
      migrated due to wake_affine. Hence, the difference is overall performance
      is marginal but detectable:
      
                                            4.15.0                 4.15.0
                                      noexit-v1r23           sdnuma-v1r23
       Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
       Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
       Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
       Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
       Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
       Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
       Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
       Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
       Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
       Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
       Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
       Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
       Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
       Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)
      
      However, what is very interesting is how automatic NUMA balancing behaves.
      Each netperf instance runs long enough for balancing to activate:
      
       NUMA base PTE updates             4620        1473
       NUMA huge PMD updates                0           0
       NUMA page range updates           4620        1473
       NUMA hint faults                  4301        1383
       NUMA hint local faults            1309         451
       NUMA hint local percent             30          32
       NUMA pages migrated               1335         491
       AutoNUMA cost                      21%          6%
      
      There is an unfortunate number of remote faults although tracing indicated
      that the vast majority are in shared libraries. However, the tendency to
      start tasks on the same node if there is capacity means that there were
      far fewer PTE updates and faults incurred overall.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c833627
    • P
      sched/fair: Do not migrate due to a sync wakeup on exit · 24d0c1d6
      Peter Zijlstra 提交于
      When a task exits, it notifies the parent that it has exited. This is a
      sync wakeup and the exiting task may pull the parent towards the wakers
      CPU. For simple workloads like using a shell, it was observed that the
      shell is pulled across nodes by exiting processes. This is daft as the
      parent may be long-lived and properly placed. This patch special cases a
      sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
      of workloads and machines showed very little differences in performance
      although there was a small 3% boost on some machines running a shellscript
      intensive workload (git regression test suite).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      24d0c1d6
    • M
      sched/fair: Do not migrate on wake_affine_weight() if weights are equal · 082f764a
      Mel Gorman 提交于
      wake_affine_weight() will consider migrating a task to, or near, the current
      CPU if there is a load imbalance. If the CPUs share LLC then either CPU
      is valid as a search-for-idle-sibling target and equally appropriate for
      stacking two tasks on one CPU if an idle sibling is unavailable. If they do
      not share cache then a cross-node migration potentially impacts locality
      so while they are equal from a CPU capacity point of view, they are not
      equal in terms of memory locality. In either case, it's more appropriate
      to migrate only if there is a difference in their effective load.
      
      This patch modifies wake_affine_weight() to only consider migrating a task
      if there is a load imbalance for normal wakeups but will allow potential
      stacking if the loads are equal and it's a sync wakeup.
      
      For the most part, the different in performance is marginal. For example,
      on a 4-socket server running netperf UDP_STREAM on localhost the differences
      are as follows:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
       Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
       Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
       Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
       Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
       Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
       Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
       Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
       Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
       Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
       Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
       Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
       Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
       Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
       Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
       Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
       Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
       Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
       Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
       Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
       Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
       Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
       Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
       Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
       Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
       Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
       Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
       Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)
      
      The difference in overall performance is marginal but note that most
      measurements are less variable. There were similar observations for other
      netperf comparisons. hackbench with sockets or threads with processes or
      threads showed minor difference with some reduction of migration. tbench
      showed only marginal differences that were within the noise. dbench,
      regardless of filesystem, showed minor differences all of which are
      within noise. Multiple machines, both UMA and NUMA were tested without
      any regressions showing up.
      
      The biggest risk with a patch like this is affecting wakeup latencies.
      However, the schbench load from Facebook which is very sensitive to wakeup
      latency showed a mixed result with mostly improvements in wakeup latency:
      
                                            4.15.0                 4.15.0
                                             16rc0          noequal-v1r23
       Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
       Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
       Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
       Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
       Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
       Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
       Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
       Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
       Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
       Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
       Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
       Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
       Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
       Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
       Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
       Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
       Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
       Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
       Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
       Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
       Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
       Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
       Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
       Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
       Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
       Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
       Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
       Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
       Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
       Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
       Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
       Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
       Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
       Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
       Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
       Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
       Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
       Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
       Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
       Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
       Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
       Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
       Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
       Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
       Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
       Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
       Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
       Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      082f764a
    • M
      sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed · eeb60398
      Mel Gorman 提交于
      On sync wakeups, the previous CPU effective load may not be used so delay
      the calculation until it's needed.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eeb60398