1. 10 10月, 2017 4 次提交
    • P
      sched/idle: Move quiet_vmstate() into the NOHZ code · 62cb1188
      Peter Zijlstra 提交于
      quiet_vmstat() is an expensive function that only makes sense when we
      go into NOHZ.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: aubrey.li@linux.intel.com
      Cc: cl@linux.com
      Cc: fweisbec@gmail.com
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      62cb1188
    • P
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra 提交于
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      024c9d2f
    • P
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra 提交于
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
  2. 09 10月, 2017 1 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
  3. 08 10月, 2017 1 次提交
    • A
      bpf: fix liveness marking · 8fe2d6cc
      Alexei Starovoitov 提交于
      while processing Rx = Ry instruction the verifier does
      regs[insn->dst_reg] = regs[insn->src_reg]
      which often clears write mark (when Ry doesn't have it)
      that was just set by check_reg_arg(Rx) prior to the assignment.
      That causes mark_reg_read() to keep marking Rx in this block as
      REG_LIVE_READ (since the logic incorrectly misses that it's
      screened by the write) and in many of its parents (until lucky
      write into the same Rx or beginning of the program).
      That causes is_state_visited() logic to miss many pruning opportunities.
      
      Furthermore mark_reg_read() logic propagates the read mark
      for BPF_REG_FP as well (though it's readonly) which causes
      harmless but unnecssary work during is_state_visited().
      Note that do_propagate_liveness() skips FP correctly,
      so do the same in mark_reg_read() as well.
      It saves 0.2 seconds for the test below
      
      program               before  after
      bpf_lb-DLB_L3.o       2604    2304
      bpf_lb-DLB_L4.o       11159   3723
      bpf_lb-DUNKNOWN.o     1116    1110
      bpf_lxc-DDROP_ALL.o   34566   28004
      bpf_lxc-DUNKNOWN.o    53267   39026
      bpf_netdev.o          17843   16943
      bpf_overlay.o         8672    7929
      time                  ~11 sec  ~4 sec
      
      Fixes: dc503a8a ("bpf/verifier: track liveness for pruning")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8fe2d6cc
  4. 04 10月, 2017 14 次提交
  5. 03 10月, 2017 2 次提交
    • P
      rcu: Remove extraneous READ_ONCE()s from rcu_irq_{enter,exit}() · f39b536c
      Paul E. McKenney 提交于
      The read of ->dynticks_nmi_nesting in rcu_irq_enter() and rcu_irq_exit()
      is currently protected with READ_ONCE().  However, this protection is
      unnecessary because (1) ->dynticks_nmi_nesting is updated only by the
      current CPU, (2) Although NMI handlers can update this field, they reset
      it back to its old value before return, and (3) Interrupts are disabled,
      so nothing else can modify it.  The value of ->dynticks_nmi_nesting is
      thus effectively constant, and so no protection is required.
      
      This commit therefore removes the READ_ONCE() protection from these
      two accesses.
      
      Link: http://lkml.kernel.org/r/20170926031902.GA2074@linux.vnet.ibm.comReported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      f39b536c
    • S
      ftrace: Fix kmemleak in unregister_ftrace_graph · 2b0b8499
      Shu Wang 提交于
      The trampoline allocated by function tracer was overwriten by function_graph
      tracer, and caused a memory leak. The save_global_trampoline should have
      saved the previous trampoline in register_ftrace_graph() and restored it in
      unregister_ftrace_graph(). But as it is implemented, save_global_trampoline was
      only used in unregister_ftrace_graph as default value 0, and it overwrote the
      previous trampoline's value. Causing the previous allocated trampoline to be
      lost.
      
      kmmeleak backtrace:
          kmemleak_vmalloc+0x77/0xc0
          __vmalloc_node_range+0x1b5/0x2c0
          module_alloc+0x7c/0xd0
          arch_ftrace_update_trampoline+0xb5/0x290
          ftrace_startup+0x78/0x210
          register_ftrace_function+0x8b/0xd0
          function_trace_init+0x4f/0x80
          tracing_set_tracer+0xe6/0x170
          tracing_set_trace_write+0x90/0xd0
          __vfs_write+0x37/0x170
          vfs_write+0xb2/0x1b0
          SyS_write+0x55/0xc0
          do_syscall_64+0x67/0x180
          return_from_SYSCALL_64+0x0/0x6a
      
      [
        Looking further into this, I found that this was left over from when the
        function and function graph tracers shared the same ftrace_ops. But in
        commit 5f151b24 ("ftrace: Fix function_profiler and function tracer
        together"), the two were separated, and the save_global_trampoline no
        longer was necessary (and it may have been broken back then too).
        -- Steven Rostedt
      ]
      
      Link: http://lkml.kernel.org/r/20170912021454.5976-1-shuwang@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 5f151b24 ("ftrace: Fix function_profiler and function tracer together")
      Signed-off-by: NShu Wang <shuwang@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      2b0b8499
  6. 30 9月, 2017 18 次提交
    • A
      fix infoleak in waitid(2) · 6c85501f
      Al Viro 提交于
      kernel_waitid() can return a PID, an error or 0.  rusage is filled in the first
      case and waitid(2) rusage should've been copied out exactly in that case, *not*
      whenever kernel_waitid() has not returned an error.  Compat variant shares that
      braino; none of kernel_wait4() callers do, so the below ought to fix it.
      Reported-and-tested-by: NAlexander Potapenko <glider@google.com>
      Fixes: ce72a16f ("wait4(2)/waitid(2): separate copying rusage to userland")
      Cc: stable@vger.kernel.org # v4.13
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6c85501f
    • P
      sched/fair: Update calc_group_*() comments · 17de4ee0
      Peter Zijlstra 提交于
      I had a wee bit of trouble recalling how the calc_group_runnable()
      stuff worked.. add hopefully better comments.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      17de4ee0
    • J
      sched/fair: Calculate runnable_weight slightly differently · 2c8e4dce
      Josef Bacik 提交于
      Our runnable_weight currently looks like this
      
      runnable_weight = shares * runnable_load_avg / load_avg
      
      The goal is to scale the runnable weight for the group based on its runnable to
      load_avg ratio.  The problem with this is it biases us towards tasks that never
      go to sleep.  Tasks that go to sleep are going to have their runnable_load_avg
      decayed pretty hard, which will drastically reduce the runnable weight of groups
      with interactive tasks.  To solve this imbalance we tweak this slightly, so in
      the ideal case it is still the above, but in the interactive case it is
      
      runnable_weight = shares * runnable_weight / load_weight
      
      which will make the weight distribution fairer between interactive and
      non-interactive groups.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Cc: linux-kernel@vger.kernel.org
      Cc: riel@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c8e4dce
    • P
      sched/fair: Implement more accurate async detach · 9a2dd585
      Peter Zijlstra 提交于
      The problem with the overestimate is that it will subtract too big a
      value from the load_sum, thereby pushing it down further than it ought
      to go. Since runnable_load_avg is not subject to a similar 'force',
      this results in the occasional 'runnable_load > load' situation.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9a2dd585
    • P
      sched/fair: Align PELT windows between cfs_rq and its se · f207934f
      Peter Zijlstra 提交于
      The PELT _sum values are a saw-tooth function, dropping on the decay
      edge and then growing back up again during the window.
      
      When these window-edges are not aligned between cfs_rq and se, we can
      have the situation where, for example, on dequeue, the se decays
      first.
      
      Its _sum values will be small(er), while the cfs_rq _sum values will
      still be on their way up. Because of this, the subtraction:
      cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
      will then, once the cfs_rq reaches an edge, translate into its _avg
      value jumping up.
      
      This is especially visible with the runnable_load bits, since they get
      added/subtracted a lot.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f207934f
    • P
      sched/fair: Implement synchonous PELT detach on load-balance migrate · 144d8487
      Peter Zijlstra 提交于
      Vincent wondered why his self migrating task had a roughly 50% dip in
      load_avg when landing on the new CPU. This is because we uncondionally
      take the asynchronous detatch_entity route, which can lead to the
      attach on the new CPU still seeing the old CPU's contribution to
      tg->load_avg, effectively halving the new CPU's shares.
      
      While in general this is something we have to live with, there is the
      special case of runnable migration where we can do better.
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      144d8487
    • P
      sched/fair: Propagate an effective runnable_load_avg · 1ea6c46a
      Peter Zijlstra 提交于
      The load balancer uses runnable_load_avg as load indicator. For
      !cgroup this is:
      
        runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
      
      That is, a direct sum of all runnable tasks on that runqueue. As
      opposed to load_avg, which is a sum of all tasks on the runqueue,
      which includes a blocked component.
      
      However, in the cgroup case, this comes apart since the group entities
      are always runnable, even if most of their constituent entities are
      blocked.
      
      Therefore introduce a runnable_weight which for task entities is the
      same as the regular weight, but for group entities is a fraction of
      the entity weight and represents the runnable part of the group
      runqueue.
      
      Then propagate this load through the PELT hierarchy to arrive at an
      effective runnable load avgerage -- which we should not confuse with
      the canonical runnable load average.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1ea6c46a
    • P
      sched/fair: Rewrite PELT migration propagation · 0e2d2aaa
      Peter Zijlstra 提交于
      When an entity migrates in (or out) of a runqueue, we need to add (or
      remove) its contribution from the entire PELT hierarchy, because even
      non-runnable entities are included in the load average sums.
      
      In order to do this we have some propagation logic that updates the
      PELT tree, however the way it 'propagates' the runnable (or load)
      change is (more or less):
      
                           tg->weight * grq->avg.load_avg
        ge->avg.load_avg = ------------------------------
                                     tg->load_avg
      
      But that is the expression for ge->weight, and per the definition of
      load_avg:
      
        ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
      
      That destroys the runnable_avg (by setting it to 1) we wanted to
      propagate.
      
      Instead directly propagate runnable_sum.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e2d2aaa
    • P
      sched/fair: Rewrite cfs_rq->removed_*avg · 2a2f5d4e
      Peter Zijlstra 提交于
      Since on wakeup migration we don't hold the rq->lock for the old CPU
      we cannot update its state. Instead we add the removed 'load' to an
      atomic variable and have the next update on that CPU collect and
      process it.
      
      Currently we have 2 atomic variables; which already have the issue
      that they can be read out-of-sync. Also, two atomic ops on a single
      cacheline is already more expensive than an uncontended lock.
      
      Since we want to add more, convert the thing over to an explicit
      cacheline with a lock in.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2a2f5d4e
    • V
      sched/fair: Use reweight_entity() for set_user_nice() · 9059393e
      Vincent Guittot 提交于
      Now that we directly change load_avg and propagate that change into
      the sums, sys_nice() and co should do the same, otherwise its possible
      to confuse load accounting when we migrate near the weight change.
      Fixes-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added changelog, fixed the call condition. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9059393e
    • P
      sched/fair: More accurate reweight_entity() · 840c5abc
      Peter Zijlstra 提交于
      When a (group) entity changes it's weight we should instantly change
      its load_avg and propagate that change into the sums it is part of.
      Because we use these values to predict future behaviour and are not
      interested in its historical value.
      
      Without this change, the change in load would need to propagate
      through the average, by which time it could again have changed etc..
      always chasing itself.
      
      With this change, the cfs_rq load_avg sum will more accurately reflect
      the current runnable and expected return of blocked load.
      Reported-by: NPaul Turner <pjt@google.com>
      [josef: compile fix !SMP || !FAIR_GROUP]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      840c5abc
    • P
      sched/fair: Introduce {en,de}queue_load_avg() · 8d5b9025
      Peter Zijlstra 提交于
      Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
      for {en,de}queue_load_avg(). More users will follow.
      
      Includes some code movement to avoid fwd declarations.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d5b9025
    • P
      sched/fair: Rename {en,de}queue_entity_load_avg() · b5b3e35f
      Peter Zijlstra 提交于
      Since they're now purely about runnable_load, rename them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5b3e35f
    • P
      sched/fair: Move enqueue migrate handling · b382a531
      Peter Zijlstra 提交于
      Move the entity migrate handling from enqueue_entity_load_avg() to
      update_load_avg(). This has two benefits:
      
       - {en,de}queue_entity_load_avg() will become purely about managing
         runnable_load
      
       - we can avoid a double update_tg_load_avg() and reduce pressure on
         the global tg->shares cacheline
      
      The reason we do this is so that we can change update_cfs_shares() to
      change both weight and (future) runnable_weight. For this to work we
      need to have the cfs_rq averages up-to-date (which means having done
      the attach), but we need the cfs_rq->avg.runnable_avg to not yet
      include the se's contribution (since se->on_rq == 0).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b382a531
    • P
      sched/fair: Change update_load_avg() arguments · 88c0616e
      Peter Zijlstra 提交于
      Most call sites of update_load_avg() already have cfs_rq_of(se)
      available, pass it down instead of recomputing it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88c0616e
    • P
      sched/fair: Remove se->load.weight from se->avg.load_sum · c7b50216
      Peter Zijlstra 提交于
      Remove the load from the load_sum for sched_entities, basically
      turning load_sum into runnable_sum.  This prepares for better
      reweighting of group entities.
      
      Since we now have different rules for computing load_avg, split
      ___update_load_avg() into two parts, ___update_load_sum() and
      ___update_load_avg().
      
      So for se:
      
        ___update_load_sum(.weight = 1)
        ___upate_load_avg(.weight = se->load.weight)
      
      and for cfs_rq:
      
        ___update_load_sum(.weight = cfs_rq->load.weight)
        ___upate_load_avg(.weight = 1)
      
      Since the primary consumable is load_avg, most things will not be
      affected. Only those few sites that initialize/modify load_sum need
      attention.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7b50216
    • P
      sched/fair: Cure calc_cfs_shares() vs. reweight_entity() · 3d4b60d3
      Peter Zijlstra 提交于
      Vincent reported that when running in a cgroup, his root
      cfs_rq->avg.load_avg dropped to 0 on task idle.
      
      This is because reweight_entity() will now immediately propagate the
      weight change of the group entity to its cfs_rq, and as it happens,
      our approxmation (5) for calc_cfs_shares() results in 0 when the group
      is idle.
      
      Avoid this by using the correct (3) as a lower bound on (5). This way
      the empty cgroup will slowly decay instead of instantly drop to 0.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d4b60d3
    • P
      sched/fair: Add comment to calc_cfs_shares() · cef27403
      Peter Zijlstra 提交于
      Explain the magic equation in calc_cfs_shares() a bit better.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cef27403