1. 27 10月, 2017 3 次提交
  2. 10 10月, 2017 11 次提交
    • B
      sched/fair: Fix usage of find_idlest_group() when the local group is idlest · 93f50f90
      Brendan Jackman 提交于
      find_idlest_group() returns NULL when the local group is idlest. The
      caller then continues the find_idlest_group() search at a lower level
      of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
      not consulted and, crucially, @new_cpu is not updated. This means the
      search is pointless and we return @prev_cpu from select_task_rq_fair().
      
      This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      93f50f90
    • B
      sched/fair: Fix usage of find_idlest_group() when no groups are allowed · 6fee85cc
      Brendan Jackman 提交于
      When 'p' is not allowed on any of the CPUs in the sched_domain, we
      currently return NULL from find_idlest_group(), and pointlessly
      continue the search on lower sched_domain levels (where 'p' is also not
      allowed) before returning prev_cpu regardless (as we have not updated
      new_cpu).
      
      Add an explicit check for this case, and add a comment to
      find_idlest_group(). Now when find_idlest_group() returns NULL, it always
      means that the local group is allowed and idlest.
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fee85cc
    • B
      sched/fair: Fix find_idlest_group() when local group is not allowed · 0d10ab95
      Brendan Jackman 提交于
      When the local group is not allowed we do not modify this_*_load from
      their initial value of 0. That means that the load checks at the end
      of find_idlest_group cause us to incorrectly return NULL. Fixing the
      initial values to ULONG_MAX means we will instead return the idlest
      remote group in that case.
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0d10ab95
    • B
      sched/fair: Remove unnecessary comparison with -1 · e90381ea
      Brendan Jackman 提交于
      Since commit:
      
        83a0a96a ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")
      
      find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
      so we can simplify the checking of the return value in find_idlest_cpu().
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e90381ea
    • B
      sched/fair: Move select_task_rq_fair() slow-path into its own function · 18bd1b4b
      Brendan Jackman 提交于
      In preparation for changes that would otherwise require adding a new
      level of indentation to the while(sd) loop, create a new function
      find_idlest_cpu() which contains this loop, and rename the existing
      find_idlest_cpu() to find_idlest_group_cpu().
      
      Code inside the while(sd) loop is unchanged. @new_cpu is added as a
      variable in the new function, with the same initial value as the
      @new_cpu in select_task_rq_fair().
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18bd1b4b
    • B
      sched/fair: Force balancing on NOHZ balance if local group has capacity · 583ffd99
      Brendan Jackman 提交于
      The "goto force_balance" here is intended to mitigate the fact that
      avg_load calculations can result in bad placement decisions when
      priority is asymmetrical.
      
      The original commit that adds it:
      
        fab47622 ("sched: Force balancing on newidle balance if local group has capacity")
      
      explains:
      
          Under certain situations, such as a niced down task (i.e. nice =
          -15) in the presence of nr_cpus NICE0 tasks, the niced task lands
          on a sched group and kicks away other tasks because of its large
          weight. This leads to sub-optimal utilization of the
          machine. Even though the sched group has capacity, it does not
          pull tasks because sds.this_load >> sds.max_load, and f_b_g()
          returns NULL.
      
      A similar but inverted issue also affects ARM big.LITTLE (asymmetrical CPU
      capacity) systems - consider 8 always-running, same-priority tasks on a
      system with 4 "big" and 4 "little" CPUs. Suppose that 5 of them end up on
      the "big" CPUs (which will be represented by one sched_group in the DIE
      sched_domain) and 3 on the "little" (the other sched_group in DIE), leaving
      one CPU unused. Because the "big" group has a higher group_capacity its
      avg_load may not present an imbalance that would cause migrating a
      task to the idle "little".
      
      The force_balance case here solves the problem but currently only for
      CPU_NEWLY_IDLE balances, which in theory might never happen on the
      unused CPU. Including CPU_IDLE in the force_balance case means
      there's an upper bound on the time before we can attempt to solve the
      underutilization: after DIE's sd->balance_interval has passed the
      next nohz balance kick will help us out.
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170807163900.25180-1-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      583ffd99
    • B
      sched/fair: Sync task util before slow-path wakeup · ea16f0ea
      Brendan Jackman 提交于
      We use task_util() in find_idlest_group() via capacity_spare_wake().
      This task_util() updated in wake_cap(). However wake_cap() is not the
      only reason for ending up in find_idlest_group() - we could have been sent
      there by wake_wide(). So explicitly sync the task util with prev_cpu
      when we are about to head to find_idlest_group().
      
      We could simply do this at the beginning of
      select_task_rq_fair() (i.e. irrespective of whether we're heading to
      select_idle_sibling() or find_idlest_group() & co), but I didn't want to
      slow down the select_idle_sibling() path more than necessary.
      
      Don't do this during fork balancing, we won't need the task_util and
      we'd just clobber the last_update_time, which is supposed to be 0.
      Signed-off-by: NBrendan Jackman <brendan.jackman@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andres Oportus <andresoportus@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea16f0ea
    • U
      sched/fair: Search a task from the tail of the queue · 93824900
      Uladzislau Rezki 提交于
      As a first step this patch makes cfs_tasks list as MRU one.
      It means, that when a next task is picked to run on physical
      CPU it is moved to the front of the list.
      
      Therefore, the cfs_tasks list is more or less sorted (except
      woken tasks) starting from recently given CPU time tasks toward
      tasks with max wait time in a run-queue, i.e. MRU list.
      
      Second, as part of the load balance operation, this approach
      starts detach_tasks()/detach_one_task() from the tail of the
      queue instead of the head, giving some advantages:
      
       - tends to pick a task with highest wait time;
      
       - tasks located in the tail are less likely cache-hot,
         therefore the can_migrate_task() decision is higher.
      
      hackbench illustrates slightly better performance. For example
      doing 1000 samples and 40 groups on i5-3320M CPU, it shows below
      figures:
      
       default: 0.657 avg
       patched: 0.646 avg
      Signed-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      93824900
    • P
      sched/core: Ensure load_balance() respects the active_mask · 024c9d2f
      Peter Zijlstra 提交于
      While load_balance() masks the source CPUs against active_mask, it had
      a hole against the destination CPU. Ensure the destination CPU is also
      part of the 'domain-mask & active-mask' set.
      Reported-by: NLevin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 77d1dfda ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      024c9d2f
    • P
      sched/core: Address more wake_affine() regressions · f2cdd9cc
      Peter Zijlstra 提交于
      The trivial wake_affine_idle() implementation is very good for a
      number of workloads, but it comes apart at the moment there are no
      idle CPUs left, IOW. the overloaded case.
      
      hackbench:
      
      		NO_WA_WEIGHT		WA_WEIGHT
      
      hackbench-20  : 7.362717561 seconds	6.450509391 seconds
      
      (win)
      
      netperf:
      
      		  NO_WA_WEIGHT		WA_WEIGHT
      
      TCP_SENDFILE-1	: Avg: 54524.6		Avg: 52224.3
      TCP_SENDFILE-10	: Avg: 48185.2          Avg: 46504.3
      TCP_SENDFILE-20	: Avg: 29031.2          Avg: 28610.3
      TCP_SENDFILE-40	: Avg: 9819.72          Avg: 9253.12
      TCP_SENDFILE-80	: Avg: 5355.3           Avg: 4687.4
      
      TCP_STREAM-1	: Avg: 41448.3          Avg: 42254
      TCP_STREAM-10	: Avg: 24123.2          Avg: 25847.9
      TCP_STREAM-20	: Avg: 15834.5          Avg: 18374.4
      TCP_STREAM-40	: Avg: 5583.91          Avg: 5599.57
      TCP_STREAM-80	: Avg: 2329.66          Avg: 2726.41
      
      TCP_RR-1	: Avg: 80473.5          Avg: 82638.8
      TCP_RR-10	: Avg: 72660.5          Avg: 73265.1
      TCP_RR-20	: Avg: 52607.1          Avg: 52634.5
      TCP_RR-40	: Avg: 57199.2          Avg: 56302.3
      TCP_RR-80	: Avg: 25330.3          Avg: 26867.9
      
      UDP_RR-1	: Avg: 108266           Avg: 107844
      UDP_RR-10	: Avg: 95480            Avg: 95245.2
      UDP_RR-20	: Avg: 68770.8          Avg: 68673.7
      UDP_RR-40	: Avg: 76231            Avg: 75419.1
      UDP_RR-80	: Avg: 34578.3          Avg: 35639.1
      
      UDP_STREAM-1	: Avg: 64684.3          Avg: 66606
      UDP_STREAM-10	: Avg: 52701.2          Avg: 52959.5
      UDP_STREAM-20	: Avg: 30376.4          Avg: 29704
      UDP_STREAM-40	: Avg: 15685.8          Avg: 15266.5
      UDP_STREAM-80	: Avg: 8415.13          Avg: 7388.97
      
      (wins and losses)
      
      sysbench:
      
      		    NO_WA_WEIGHT		WA_WEIGHT
      
      sysbench-mysql-2  :  2135.17 per sec.		 2142.51 per sec.
      sysbench-mysql-5  :  4809.68 per sec.            4800.19 per sec.
      sysbench-mysql-10 :  9158.59 per sec.            9157.05 per sec.
      sysbench-mysql-20 : 14570.70 per sec.           14543.55 per sec.
      sysbench-mysql-40 : 22130.56 per sec.           22184.82 per sec.
      sysbench-mysql-80 : 20995.56 per sec.           21904.18 per sec.
      
      sysbench-psql-2   :  1679.58 per sec.            1705.06 per sec.
      sysbench-psql-5   :  3797.69 per sec.            3879.93 per sec.
      sysbench-psql-10  :  7253.22 per sec.            7258.06 per sec.
      sysbench-psql-20  : 11166.75 per sec.           11220.00 per sec.
      sysbench-psql-40  : 17277.28 per sec.           17359.78 per sec.
      sysbench-psql-80  : 17112.44 per sec.           17221.16 per sec.
      
      (increase on the top end)
      
      tbench:
      
      NO_WA_WEIGHT
      
      Throughput 685.211 MB/sec   2 clients   2 procs  max_latency=0.123 ms
      Throughput 1596.64 MB/sec   5 clients   5 procs  max_latency=0.119 ms
      Throughput 2985.47 MB/sec  10 clients  10 procs  max_latency=0.262 ms
      Throughput 4521.15 MB/sec  20 clients  20 procs  max_latency=0.506 ms
      Throughput 9438.1  MB/sec  40 clients  40 procs  max_latency=2.052 ms
      Throughput 8210.5  MB/sec  80 clients  80 procs  max_latency=8.310 ms
      
      WA_WEIGHT
      
      Throughput 697.292 MB/sec   2 clients   2 procs  max_latency=0.127 ms
      Throughput 1596.48 MB/sec   5 clients   5 procs  max_latency=0.080 ms
      Throughput 2975.22 MB/sec  10 clients  10 procs  max_latency=0.254 ms
      Throughput 4575.14 MB/sec  20 clients  20 procs  max_latency=0.502 ms
      Throughput 9468.65 MB/sec  40 clients  40 procs  max_latency=2.069 ms
      Throughput 8631.73 MB/sec  80 clients  80 procs  max_latency=8.605 ms
      
      (increase on the top end)
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f2cdd9cc
    • P
      sched/core: Fix wake_affine() performance regression · d153b153
      Peter Zijlstra 提交于
      Eric reported a sysbench regression against commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
      against his v3.10 enterprise kernel.
      
      PRE (current tip/master):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64110  (2136.94 per sec.)
         5: [30 secs]     transactions:                        143644 (4787.99 per sec.)
        10: [30 secs]     transactions:                        274298 (9142.93 per sec.)
        20: [30 secs]     transactions:                        418683 (13955.45 per sec.)
        40: [30 secs]     transactions:                        320731 (10690.15 per sec.)
        80: [30 secs]     transactions:                        355096 (11834.28 per sec.)
      
       hsw-ex NAS:
      
       OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds =                    18.01
       OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds =                    17.89
       OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds =                    17.93
       lu.C.x_threads_144_run_1.log: Time in seconds =                   434.68
       lu.C.x_threads_144_run_2.log: Time in seconds =                   405.36
       lu.C.x_threads_144_run_3.log: Time in seconds =                   433.83
      
      POST (+patch):
      
       ivb-ep sysbench:
      
         2: [30 secs]     transactions:                        64494  (2149.75 per sec.)
         5: [30 secs]     transactions:                        145114 (4836.99 per sec.)
        10: [30 secs]     transactions:                        278311 (9276.69 per sec.)
        20: [30 secs]     transactions:                        437169 (14571.60 per sec.)
        40: [30 secs]     transactions:                        669837 (22326.73 per sec.)
        80: [30 secs]     transactions:                        631739 (21055.88 per sec.)
      
       hsw-ex NAS:
      
       lu.C.x_threads_144_run_1.log: Time in seconds =                    23.36
       lu.C.x_threads_144_run_2.log: Time in seconds =                    22.96
       lu.C.x_threads_144_run_3.log: Time in seconds =                    22.52
      
      This patch takes out all the shiny wake_affine() stuff and goes back to
      utter basics. Between the two CPUs involved with the wakeup (the CPU
      doing the wakeup and the CPU we ran on previously) pick the CPU we can
      run on _now_.
      
      This restores much of the regressions against the older kernels,
      but leaves some ground in the overloaded case. The default-enabled
      WA_WEIGHT (which will be introduced in the next patch) is an attempt
      to address the overloaded situation.
      Reported-by: NEric Farman <farman@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Rosato <mjrosato@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jinpuwang@gmail.com
      Cc: vcaputo@pengaru.com
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d153b153
  3. 30 9月, 2017 18 次提交
    • P
      sched/fair: Update calc_group_*() comments · 17de4ee0
      Peter Zijlstra 提交于
      I had a wee bit of trouble recalling how the calc_group_runnable()
      stuff worked.. add hopefully better comments.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      17de4ee0
    • J
      sched/fair: Calculate runnable_weight slightly differently · 2c8e4dce
      Josef Bacik 提交于
      Our runnable_weight currently looks like this
      
      runnable_weight = shares * runnable_load_avg / load_avg
      
      The goal is to scale the runnable weight for the group based on its runnable to
      load_avg ratio.  The problem with this is it biases us towards tasks that never
      go to sleep.  Tasks that go to sleep are going to have their runnable_load_avg
      decayed pretty hard, which will drastically reduce the runnable weight of groups
      with interactive tasks.  To solve this imbalance we tweak this slightly, so in
      the ideal case it is still the above, but in the interactive case it is
      
      runnable_weight = shares * runnable_weight / load_weight
      
      which will make the weight distribution fairer between interactive and
      non-interactive groups.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team@fb.com
      Cc: linux-kernel@vger.kernel.org
      Cc: riel@redhat.com
      Cc: tj@kernel.org
      Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c8e4dce
    • P
      sched/fair: Implement more accurate async detach · 9a2dd585
      Peter Zijlstra 提交于
      The problem with the overestimate is that it will subtract too big a
      value from the load_sum, thereby pushing it down further than it ought
      to go. Since runnable_load_avg is not subject to a similar 'force',
      this results in the occasional 'runnable_load > load' situation.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9a2dd585
    • P
      sched/fair: Align PELT windows between cfs_rq and its se · f207934f
      Peter Zijlstra 提交于
      The PELT _sum values are a saw-tooth function, dropping on the decay
      edge and then growing back up again during the window.
      
      When these window-edges are not aligned between cfs_rq and se, we can
      have the situation where, for example, on dequeue, the se decays
      first.
      
      Its _sum values will be small(er), while the cfs_rq _sum values will
      still be on their way up. Because of this, the subtraction:
      cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This
      will then, once the cfs_rq reaches an edge, translate into its _avg
      value jumping up.
      
      This is especially visible with the runnable_load bits, since they get
      added/subtracted a lot.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f207934f
    • P
      sched/fair: Implement synchonous PELT detach on load-balance migrate · 144d8487
      Peter Zijlstra 提交于
      Vincent wondered why his self migrating task had a roughly 50% dip in
      load_avg when landing on the new CPU. This is because we uncondionally
      take the asynchronous detatch_entity route, which can lead to the
      attach on the new CPU still seeing the old CPU's contribution to
      tg->load_avg, effectively halving the new CPU's shares.
      
      While in general this is something we have to live with, there is the
      special case of runnable migration where we can do better.
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      144d8487
    • P
      sched/fair: Propagate an effective runnable_load_avg · 1ea6c46a
      Peter Zijlstra 提交于
      The load balancer uses runnable_load_avg as load indicator. For
      !cgroup this is:
      
        runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
      
      That is, a direct sum of all runnable tasks on that runqueue. As
      opposed to load_avg, which is a sum of all tasks on the runqueue,
      which includes a blocked component.
      
      However, in the cgroup case, this comes apart since the group entities
      are always runnable, even if most of their constituent entities are
      blocked.
      
      Therefore introduce a runnable_weight which for task entities is the
      same as the regular weight, but for group entities is a fraction of
      the entity weight and represents the runnable part of the group
      runqueue.
      
      Then propagate this load through the PELT hierarchy to arrive at an
      effective runnable load avgerage -- which we should not confuse with
      the canonical runnable load average.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1ea6c46a
    • P
      sched/fair: Rewrite PELT migration propagation · 0e2d2aaa
      Peter Zijlstra 提交于
      When an entity migrates in (or out) of a runqueue, we need to add (or
      remove) its contribution from the entire PELT hierarchy, because even
      non-runnable entities are included in the load average sums.
      
      In order to do this we have some propagation logic that updates the
      PELT tree, however the way it 'propagates' the runnable (or load)
      change is (more or less):
      
                           tg->weight * grq->avg.load_avg
        ge->avg.load_avg = ------------------------------
                                     tg->load_avg
      
      But that is the expression for ge->weight, and per the definition of
      load_avg:
      
        ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
      
      That destroys the runnable_avg (by setting it to 1) we wanted to
      propagate.
      
      Instead directly propagate runnable_sum.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e2d2aaa
    • P
      sched/fair: Rewrite cfs_rq->removed_*avg · 2a2f5d4e
      Peter Zijlstra 提交于
      Since on wakeup migration we don't hold the rq->lock for the old CPU
      we cannot update its state. Instead we add the removed 'load' to an
      atomic variable and have the next update on that CPU collect and
      process it.
      
      Currently we have 2 atomic variables; which already have the issue
      that they can be read out-of-sync. Also, two atomic ops on a single
      cacheline is already more expensive than an uncontended lock.
      
      Since we want to add more, convert the thing over to an explicit
      cacheline with a lock in.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2a2f5d4e
    • V
      sched/fair: Use reweight_entity() for set_user_nice() · 9059393e
      Vincent Guittot 提交于
      Now that we directly change load_avg and propagate that change into
      the sums, sys_nice() and co should do the same, otherwise its possible
      to confuse load accounting when we migrate near the weight change.
      Fixes-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added changelog, fixed the call condition. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9059393e
    • P
      sched/fair: More accurate reweight_entity() · 840c5abc
      Peter Zijlstra 提交于
      When a (group) entity changes it's weight we should instantly change
      its load_avg and propagate that change into the sums it is part of.
      Because we use these values to predict future behaviour and are not
      interested in its historical value.
      
      Without this change, the change in load would need to propagate
      through the average, by which time it could again have changed etc..
      always chasing itself.
      
      With this change, the cfs_rq load_avg sum will more accurately reflect
      the current runnable and expected return of blocked load.
      Reported-by: NPaul Turner <pjt@google.com>
      [josef: compile fix !SMP || !FAIR_GROUP]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      840c5abc
    • P
      sched/fair: Introduce {en,de}queue_load_avg() · 8d5b9025
      Peter Zijlstra 提交于
      Analogous to the existing {en,de}queue_runnable_load_avg() add helpers
      for {en,de}queue_load_avg(). More users will follow.
      
      Includes some code movement to avoid fwd declarations.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8d5b9025
    • P
      sched/fair: Rename {en,de}queue_entity_load_avg() · b5b3e35f
      Peter Zijlstra 提交于
      Since they're now purely about runnable_load, rename them.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5b3e35f
    • P
      sched/fair: Move enqueue migrate handling · b382a531
      Peter Zijlstra 提交于
      Move the entity migrate handling from enqueue_entity_load_avg() to
      update_load_avg(). This has two benefits:
      
       - {en,de}queue_entity_load_avg() will become purely about managing
         runnable_load
      
       - we can avoid a double update_tg_load_avg() and reduce pressure on
         the global tg->shares cacheline
      
      The reason we do this is so that we can change update_cfs_shares() to
      change both weight and (future) runnable_weight. For this to work we
      need to have the cfs_rq averages up-to-date (which means having done
      the attach), but we need the cfs_rq->avg.runnable_avg to not yet
      include the se's contribution (since se->on_rq == 0).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b382a531
    • P
      sched/fair: Change update_load_avg() arguments · 88c0616e
      Peter Zijlstra 提交于
      Most call sites of update_load_avg() already have cfs_rq_of(se)
      available, pass it down instead of recomputing it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88c0616e
    • P
      sched/fair: Remove se->load.weight from se->avg.load_sum · c7b50216
      Peter Zijlstra 提交于
      Remove the load from the load_sum for sched_entities, basically
      turning load_sum into runnable_sum.  This prepares for better
      reweighting of group entities.
      
      Since we now have different rules for computing load_avg, split
      ___update_load_avg() into two parts, ___update_load_sum() and
      ___update_load_avg().
      
      So for se:
      
        ___update_load_sum(.weight = 1)
        ___upate_load_avg(.weight = se->load.weight)
      
      and for cfs_rq:
      
        ___update_load_sum(.weight = cfs_rq->load.weight)
        ___upate_load_avg(.weight = 1)
      
      Since the primary consumable is load_avg, most things will not be
      affected. Only those few sites that initialize/modify load_sum need
      attention.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c7b50216
    • P
      sched/fair: Cure calc_cfs_shares() vs. reweight_entity() · 3d4b60d3
      Peter Zijlstra 提交于
      Vincent reported that when running in a cgroup, his root
      cfs_rq->avg.load_avg dropped to 0 on task idle.
      
      This is because reweight_entity() will now immediately propagate the
      weight change of the group entity to its cfs_rq, and as it happens,
      our approxmation (5) for calc_cfs_shares() results in 0 when the group
      is idle.
      
      Avoid this by using the correct (3) as a lower bound on (5). This way
      the empty cgroup will slowly decay instead of instantly drop to 0.
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3d4b60d3
    • P
      sched/fair: Add comment to calc_cfs_shares() · cef27403
      Peter Zijlstra 提交于
      Explain the magic equation in calc_cfs_shares() a bit better.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cef27403
    • P
      sched/fair: Clean up calc_cfs_shares() · 7c80cfc9
      Peter Zijlstra 提交于
      For consistencies sake, we should have only a single reading of tg->shares.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7c80cfc9
  4. 12 9月, 2017 2 次提交
  5. 11 9月, 2017 1 次提交
  6. 09 9月, 2017 1 次提交
  7. 07 9月, 2017 1 次提交
  8. 10 8月, 2017 3 次提交
    • P
      sched/fair: Fix wake_affine() for !NUMA_BALANCING · 90001d67
      Peter Zijlstra 提交于
      In commit:
      
        3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      
      Rik changed wake_affine to consider NUMA information when balancing
      between LLC domains.
      
      There are a number of problems here which this patch tries to address:
      
       - LLC < NODE; in this case we'd use the wrong information to balance
       - !NUMA_BALANCING: in this case, the new code doesn't do any
         balancing at all
       - re-computes the NUMA data for every wakeup, this can mean iterating
         up to 64 CPUs for every wakeup.
       - default affine wakeups inside a cache
      
      We address these by saving the load/capacity values for each
      sched_domain during regular load-balance and using these values in
      wake_affine_llc(). The obvious down-side to using cached values is
      that they can be too old and poorly reflect reality.
      
      But this way we can use LLC wide information and thus not rely on
      assuming LLC matches NODE. We also don't rely on NUMA_BALANCING nor do
      we have to aggegate two nodes (or even cache domains) worth of CPUs
      for each wakeup.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: 3fed382b ("sched/numa: Implement NUMA node level wake_affine()")
      [ Minor readability improvements. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      90001d67
    • R
      sched/numa: Scale scan period with tasks in group and shared/private · b5dd77c8
      Rik van Riel 提交于
      Running 80 tasks in the same group, or as threads of the same process,
      results in the memory getting scanned 80x as fast as it would be if a
      single task was using the memory.
      
      This really hurts some workloads.
      
      Scale the scan period by the number of tasks in the numa group, and
      the shared / private ratio, so the average rate at which memory in
      the group is scanned corresponds roughly to the rate at which a single
      task would scan its memory.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jhladky@redhat.com
      Cc: lvenanci@redhat.com
      Link: http://lkml.kernel.org/r/20170731192847.23050-3-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5dd77c8
    • R
      sched/numa: Slow down scan rate if shared faults dominate · 37ec97de
      Rik van Riel 提交于
      The comment above update_task_scan_period() says the scan period should
      be increased (scanning slows down) if the majority of memory accesses
      are on the local node, or if the majority of the page accesses are
      shared with other tasks.
      
      However, with the current code, all a high ratio of shared accesses
      does is slow down the rate at which scanning is made faster.
      
      This patch changes things so either lots of shared accesses or
      lots of local accesses will slow down scanning, and numa scanning
      is sped up only when there are lots of private faults on remote
      memory pages.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jhladky@redhat.com
      Cc: lvenanci@redhat.com
      Link: http://lkml.kernel.org/r/20170731192847.23050-2-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37ec97de