1. 27 3月, 2015 5 次提交
    • V
      sched: Make scale_rt invariant with frequency · b5b4860d
      Vincent Guittot 提交于
      The average running time of RT tasks is used to estimate the remaining compute
      capacity for CFS tasks. This remaining capacity is the original capacity scaled
      down by a factor (aka scale_rt_capacity). This estimation of available capacity
      must also be invariant with frequency scaling.
      
      A frequency scaling factor is applied on the running time of the RT tasks for
      computing scale_rt_capacity.
      
      In sched_rt_avg_update(), we now scale the RT execution time like below:
      
        rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT
      
      Then, scale_rt_capacity can be summarized by:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total
      
      with available = total - rq->rt_avg
      
      This has been been optimized in current code by:
      
        scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT)
      
      But we can also developed the equation like below:
      
        scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total)
      
      and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in
      the computation of rq->rt_avg and scale_rt_capacity().
      
      so rq->rt_avg += rt_delta * arch_scale_freq_capacity()
      and
      scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total)
      
      arch_scale_frequency_capacity() will be called in the hot path of the scheduler
      which implies to have a short and efficient function.
      
      As an example, arch_scale_frequency_capacity() should return a cached value that
      is updated periodically outside of the hot path.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5b4860d
    • M
      sched: Make sched entity usage tracking scale-invariant · 0c1dc6b2
      Morten Rasmussen 提交于
      Apply frequency scale-invariance correction factor to usage tracking.
      
      Each segment of the running_avg_sum geometric series is now scaled by the
      current frequency so the utilization_avg_contrib of each entity will be
      invariant with frequency scaling.
      
      As a result, utilization_load_avg which is the sum of utilization_avg_contrib,
      becomes invariant too. So the usage level that is returned by get_cpu_usage(),
      stays relative to the max frequency as the cpu_capacity which is is compared against.
      
      Then, we want the keep the load tracking values in a 32-bit type, which implies
      that the max value of {runnable|running}_avg_sum must be lower than
      2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
      arch_scale_freq_capacity() must return a value less than
      (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
      So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425455186-13451-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0c1dc6b2
    • V
      sched: Remove frequency scaling from cpu_capacity · a8faa8f5
      Vincent Guittot 提交于
      Now that arch_scale_cpu_capacity has been introduced to scale the original
      capacity, the arch_scale_freq_capacity is no longer used (it was
      previously used by ARM arch).
      
      Remove arch_scale_freq_capacity from the computation of cpu_capacity.
      The frequency invariance will be handled in the load tracking and not in
      the CPU capacity. arch_scale_freq_capacity will be revisited for scaling
      load with the current frequency of the CPUs in a later patch.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a8faa8f5
    • M
      sched: Track group sched_entity usage contributions · 21f44866
      Morten Rasmussen 提交于
      Add usage contribution tracking for group entities. Unlike
      se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
      entities is the sum of se->avg.utilization_avg_contrib for all entities on the
      group runqueue.
      
      It is _not_ influenced in any way by the task group h_load. Hence it is
      representing the actual cpu usage of the group, not its intended load
      contribution which may differ significantly from the utilization on
      lightly utilized systems.
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      21f44866
    • V
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot 提交于
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ee28e4
  2. 18 2月, 2015 2 次提交
  3. 28 1月, 2015 1 次提交
    • J
      sched/fair: Avoid using uninitialized variable in preferred_group_nid() · 81907478
      Jan Beulich 提交于
      At least some gcc versions - validly afaict - warn about potentially
      using max_group uninitialized: There's no way the compiler can prove
      that the body of the conditional where it and max_faults get set/
      updated gets executed; in fact, without knowing all the details of
      other scheduler code, I can't prove this either.
      
      Generally the necessary change would appear to be to clear max_group
      prior to entering the inner loop, and break out of the outer loop when
      it ends up being all clear after the inner one. This, however, seems
      inefficient, and afaict the same effect can be achieved by exiting the
      outer loop when max_faults is still zero after the inner loop.
      
      [ mingo: changed the solution to zero initialization: uninitialized_var()
        needs to die, as it's an actively dangerous construct: if in the future
        a known-proven-good piece of code is changed to have a true, buggy
        uninitialized variable, the compiler warning is then supressed...
      
        The better long term solution is to clean up the code flow, so that
        even simple minded compilers (and humans!) are able to read it without
        getting a headache.  ]
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      81907478
  4. 14 1月, 2015 4 次提交
  5. 09 1月, 2015 2 次提交
    • T
      sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group() · 7f1a169b
      Tetsuo Handa 提交于
      When alloc_fair_sched_group() in sched_create_group() fails,
      free_sched_group() is called, and free_fair_sched_group() is called by
      free_sched_group(). Since destroy_cfs_bandwidth() is called by
      free_fair_sched_group() without calling init_cfs_bandwidth(),
      RCU stall occurs at hrtimer_cancel():
      
        INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
        Task dump for CPU 1:
        (fprintd)       R  running task        0  6249      1 0x00000088
        ...
        Call Trace:
         <IRQ>  [<ffffffff81094988>] sched_show_task+0xa8/0x110
         [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50
         [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0
         [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700
         [<ffffffff810cbf2b>] update_process_times+0x4b/0x80
         [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50
         [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70
         [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0
         [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50
         [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230
         [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70
         [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60
         [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80
         <EOI>  [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50
         [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
         [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0
         [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230
         [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30
         [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0
         [<ffffffff8108df46>] free_sched_group+0x16/0x40
         [<ffffffff810971bb>] sched_create_group+0x4b/0x80
         [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0
         [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110
         [<ffffffff81647729>] system_call_fastpath+0x12/0x17
      
      Check whether init_cfs_bandwidth() was called before calling
      destroy_cfs_bandwidth().
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      [ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7f1a169b
    • Y
      sched: Fix odd values in effective_load() calculations · 32a8df4e
      Yuyang Du 提交于
      In effective_load, we have (long w * unsigned long tg->shares) / long W,
      when w is negative, it is cast to unsigned long and hence the product is
      insanely large. Fix this by casting tg->shares to long.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32a8df4e
  6. 16 11月, 2014 5 次提交
    • W
      sched/fair: Fix stale overloaded status in the busiest group finding logic · cb0b9f24
      Wanpeng Li 提交于
      Commit caeb178c ("sched/fair: Make update_sd_pick_busiest() return
      'true' on a busier sd") changes groups to be ranked in the order of
      overloaded > imbalance > other, and busiest group is picked according
      to this order.
      
      sgs->group_capacity_factor is used to check if the group is overloaded.
      
      When the child domain prefers tasks to go to siblings first, the
      sgs->group_capacity_factor will be set lower than one in order to
      move all the excess tasks away.
      
      However, group overloaded status is not updated when
      sgs->group_capacity_factor is set to lower than one, which leads to us
      missing to find the busiest group.
      
      This patch fixes it by updating group overloaded status when sg capacity
      factor is set to one, in order to find the busiest group accurately.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com
      [ Fixed the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cb0b9f24
    • W
      sched: Move p->nr_cpus_allowed check to select_task_rq() · 6c1d9410
      Wanpeng Li 提交于
      Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
      This change will make fair.c, rt.c, and deadline.c all start with the
      same logic.
      Suggested-and-Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "pang.xunlei" <pang.xunlei@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c1d9410
    • K
      sched/fair: Kill task_struct::numa_entry and numa_group::task_list · 75389918
      Kirill Tkhai 提交于
      Nobody iterates over numa_group::task_list, this just confuses the readers.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75389918
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
    • P
      sched/numa: Avoid selecting oneself as swap target · 7af68335
      Peter Zijlstra 提交于
      Because the whole numa task selection stuff runs with preemption
      enabled (its long and expensive) we can end up migrating and selecting
      oneself as a swap target. This doesn't really work out well -- we end
      up trying to acquire the same lock twice for the swap migrate -- so
      avoid this.
      Reported-and-Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af68335
  7. 04 11月, 2014 2 次提交
  8. 28 10月, 2014 7 次提交
    • R
      sched/numa: Check all nodes when placing a pseudo-interleaved group · 9de05d48
      Rik van Riel 提交于
      In pseudo-interleaved numa_groups, all tasks try to relocate to
      the group's preferred_nid.  When a group is spread across multiple
      NUMA nodes, this can lead to tasks swapping their location with
      other tasks inside the same group, instead of swapping location with
      tasks from other NUMA groups. This can keep NUMA groups from converging.
      
      Examining all nodes, when dealing with a task in a pseudo-interleaved
      NUMA group, avoids this problem. Note that only CPUs in nodes that
      improve the task or group score are examined, so the loop isn't too
      bad.
      Tested-by: NVinod Chegu <chegu_vinod@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "Vinod Chegu" <chegu_vinod@hp.com>
      Cc: mgorman@suse.de
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141009172747.0d97c38c@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9de05d48
    • R
      sched/numa: Find the preferred nid with complex NUMA topology · 54009416
      Rik van Riel 提交于
      On systems with complex NUMA topologies, the node scoring is adjusted
      to allow workloads to converge on nodes that are near each other.
      
      The way a task group's preferred nid is determined needs to be adjusted,
      in order for the preferred_nid to be consistent with group_weight scoring.
      This ensures that we actually try to converge workloads on adjacent nodes.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413530994-9732-6-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54009416
    • R
      sched/numa: Calculate node scores in complex NUMA topologies · 6c6b1193
      Rik van Riel 提交于
      In order to do task placement on systems with complex NUMA topologies,
      it is necessary to count the faults on nodes nearby the node that is
      being examined for a potential move.
      
      In case of a system with a backplane interconnect, we are dealing with
      groups of NUMA nodes; each of the nodes within a group is the same number
      of hops away from nodes in other groups in the system. Optimal placement
      on this topology is achieved by counting all nearby nodes equally. When
      comparing nodes A and B at distance N, nearby nodes are those at distances
      smaller than N from nodes A or B.
      
      Placement strategy on a system with a glueless mesh NUMA topology needs
      to be different, because there are no natural groups of nodes determined
      by the hardware. Instead, when dealing with two nodes A and B at distance
      N, N >= 2, there will be intermediate nodes at distance < N from both nodes
      A and B. Good placement can be achieved by right shifting the faults on
      nearby nodes by the number of hops from the node being scored. In this
      context, a nearby node is any node less than the maximum distance in the
      system away from the node. Those nodes are skipped for efficiency reasons,
      there is no real policy reason to do so.
      
      Placement policy on directly connected NUMA systems is not affected.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Link: http://lkml.kernel.org/r/1413530994-9732-5-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c6b1193
    • R
      sched/numa: Prepare for complex topology placement · 7bd95320
      Rik van Riel 提交于
      Preparatory patch for adding NUMA placement on systems with
      complex NUMA topology. Also fix a potential divide by zero
      in group_weight()
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Tested-by: NChegu Vinod <chegu_vinod@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: chegu_vinod@hp.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413530994-9732-4-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7bd95320
    • K
      sched/fair: Fix division by zero sysctl_numa_balancing_scan_size · 64192658
      Kirill Tkhai 提交于
      File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
      
      This bash command reproduces problem:
      
      $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
      	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
      
      	divide error: 0000 [#1] SMP
      	Modules linked in:
      	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
      	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
      	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
      	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
      	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
      	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
      	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
      	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
      	Stack:
      	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
      	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
      	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
      	Call Trace:
      	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
      	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
      	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
      	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
      	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
      	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
      	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
      	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
      	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
      	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
      	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
      	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
      	 [<ffffffff8150d322>] page_fault+0x22/0x30
      	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP <ffff880037a6bce0>
      	---[ end trace 9a826d16936c04de ]---
      
      Also fix race in task_scan_min (it depends on compiler behaviour).
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64192658
    • Y
      sched/fair: Care divide error in update_task_scan_period() · 2847c90e
      Yasuaki Ishimatsu 提交于
      While offling node by hot removing memory, the following divide error
      occurs:
      
        divide error: 0000 [#1] SMP
        [...]
        Call Trace:
         [...] handle_mm_fault
         [...] ? try_to_wake_up
         [...] ? wake_up_state
         [...] __do_page_fault
         [...] ? do_futex
         [...] ? put_prev_entity
         [...] ? __switch_to
         [...] do_page_fault
         [...] page_fault
        [...]
        RIP  [<ffffffff810a7081>] task_numa_fault
         RSP <ffff88084eb2bcb0>
      
      The issue occurs as follows:
        1. When page fault occurs and page is allocated from node 1,
           task_struct->numa_faults_buffer_memory[] of node 1 is
           incremented and p->numa_faults_locality[] is also incremented
           as follows:
      
           o numa_faults_buffer_memory[]       o numa_faults_locality[]
                    NR_NUMA_HINT_FAULT_TYPES
                   |      0     |     1     |
           ----------------------------------  ----------------------
            node 0 |      0     |     0     |   remote |      0     |
            node 1 |      0     |     1     |   locale |      1     |
           ----------------------------------  ----------------------
      
        2. node 1 is offlined by hot removing memory.
      
        3. When page fault occurs, fault_types[] is calculated by using
           p->numa_faults_buffer_memory[] of all online nodes in
           task_numa_placement(). But node 1 was offline by step 2. So
           the fault_types[] is calculated by using only
           p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
           are set to 0.
      
        4. The values(0) of fault_types[] pass to update_task_scan_period().
      
        5. numa_faults_locality[1] is set to 1. So the following division is
           calculated.
      
              static void update_task_scan_period(struct task_struct *p,
                                      unsigned long shared, unsigned long private){
              ...
                      ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
              }
      
        6. But both of private and shared are set to 0. So divide error
           occurs here.
      
      The divide error is rare case because the trigger is node offline.
      This patch always increments denominator for avoiding divide error.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2847c90e
    • K
      sched/numa: Fix unsafe get_task_struct() in task_numa_assign() · 1effd9f1
      Kirill Tkhai 提交于
      Unlocked access to dst_rq->curr in task_numa_compare() is racy.
      If curr task is exiting this may be a reason of use-after-free:
      
      task_numa_compare()                    do_exit()
          ...                                        current->flags |= PF_EXITING;
          ...                                    release_task()
          ...                                        ~~delayed_put_task_struct()~~
          ...                                    schedule()
          rcu_read_lock()                        ...
          cur = ACCESS_ONCE(dst_rq->curr)        ...
              ...                                rq->curr = next;
              ...                                    context_switch()
              ...                                        finish_task_switch()
              ...                                            put_task_struct()
              ...                                                __put_task_struct()
              ...                                                    free_task_struct()
              task_numa_assign()                                     ...
                  get_task_struct()                                  ...
      
      As noted by Oleg:
      
        <<The lockless get_task_struct(tsk) is only safe if tsk == current
          and didn't pass exit_notify(), or if this tsk was found on a rcu
          protected list (say, for_each_process() or find_task_by_vpid()).
          IOW, it is only safe if release_task() was not called before we
          take rcu_read_lock(), in this case we can rely on the fact that
          delayed_put_pid() can not drop the (potentially) last reference
          until rcu_read_unlock().
      
          And as Kirill pointed out task_numa_compare()->task_numa_assign()
          path does get_task_struct(dst_rq->curr) and this is not safe. The
          task_struct itself can't go away, but rcu_read_lock() can't save
          us from the final put_task_struct() in finish_task_switch(); this
          reference goes away without rcu gp>>
      
      The patch provides simple check of PF_EXITING flag. If it's not set,
      this guarantees that call_rcu() of delayed_put_task_struct() callback
      hasn't happened yet, so we can safely do get_task_struct() in
      task_numa_assign().
      
      Locked dst_rq->lock protects from concurrency with the last schedule().
      Reusing or unmapping of cur's memory may happen without it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1effd9f1
  9. 10 10月, 2014 1 次提交
    • O
      mempolicy: remove the "task" arg of vma_policy_mof() and simplify it · 6b6482bb
      Oleg Nesterov 提交于
      1. vma_policy_mof(task) is simply not safe unless task == current,
         it can race with do_exit()->mpol_put(). Remove this arg and update
         its single caller.
      
      2. vma can not be NULL, remove this check and simplify the code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b6482bb
  10. 03 10月, 2014 2 次提交
    • K
      sched/fair: Delete resched_cpu() from idle_balance() · 10a12983
      Kirill Tkhai 提交于
      We already reschedule env.dst_cpu in attach_tasks()->check_preempt_curr()
      if this is necessary.
      
      Furthermore, a higher priority class task may be current on dest rq,
      we shouldn't disturb it.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140930210441.5258.55054.stgit@localhostSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10a12983
    • V
      sched: Improve sysbench performance by fixing spurious active migration · 43f4d666
      Vincent Guittot 提交于
      Since commit caeb178c ("sched/fair: Make update_sd_pick_busiest() ...")
      sd_pick_busiest returns a group that can be neither imbalanced nor overloaded
      but is only more loaded than others. This change has been introduced to ensure
      a better load balance in system that are not overloaded but as a side effect,
      it can also generate useless active migration between groups.
      
      Let take the example of 3 tasks on a quad cores system. We will always have an
      idle core so the load balance will find a busiest group (core) whenever an ILB
      is triggered and it will force an active migration (once above
      nr_balance_failed threshold) so the idle core becomes busy but another core
      will become idle. With the next ILB, the freshly idle core will try to pull the
      task of a busy CPU.
      The number of spurious active migration is not so huge in quad core system
      because the ILB is not triggered so much. But it becomes significant as soon as
      you have more than one sched_domain level like on a dual cluster of quad cores
      where the ILB is triggered every tick when you have more than 1 busy_cpu
      
      We need to ensure that the migration generate a real improveùent and will not
      only move the avg_load imbalance on another CPU.
      
      Before caeb178c, the filtering of such use
      case was ensured by the following test in f_b_g:
      
        if ((local->idle_cpus < busiest->idle_cpus) &&
      		    busiest->sum_nr_running  <= busiest->group_weight)
      
      This patch modified the condition to take into account situation where busiest
      group is not overloaded: If the diff between the number of idle cpus in 2
      groups is less than or equal to 1 and the busiest group is not overloaded,
      moving a task will not improve the load balance but just move it.
      
      A test with sysbench on a dual clusters of quad cores gives the following
      results:
      
        command: sysbench --test=cpu --num-threads=5 --max-time=5 run
      
      The HZ is 200 which means that 1000 ticks has fired during the test.
      
      With Mainline, perf gives the following figures:
      
       Samples: 727  of event 'sched:sched_migrate_task'
       Event count (approx.): 727
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          12.52%  migration/1      [unknown]      [.] 00000000
          12.52%  migration/5      [unknown]      [.] 00000000
          12.52%  migration/7      [unknown]      [.] 00000000
          12.10%  migration/6      [unknown]      [.] 00000000
          11.83%  migration/0      [unknown]      [.] 00000000
          11.83%  migration/3      [unknown]      [.] 00000000
          11.14%  migration/4      [unknown]      [.] 00000000
          10.87%  migration/2      [unknown]      [.] 00000000
           2.75%  sysbench         [unknown]      [.] 00000000
           0.83%  swapper          [unknown]      [.] 00000000
           0.55%  ktps65090charge  [unknown]      [.] 00000000
           0.41%  mmcqd/1          [unknown]      [.] 00000000
           0.14%  perf             [unknown]      [.] 00000000
      
      With this patch, perf gives the following figures
      
       Samples: 20  of event 'sched:sched_migrate_task'
       Event count (approx.): 20
        Overhead  Command          Shared Object  Symbol
        ........  ...............  .............  ..............
          80.00%  sysbench         [unknown]      [.] 00000000
          10.00%  swapper          [unknown]      [.] 00000000
           5.00%  ktps65090charge  [unknown]      [.] 00000000
           5.00%  migration/1      [unknown]      [.] 00000000
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1412170735-5356-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      43f4d666
  11. 24 9月, 2014 3 次提交
  12. 21 9月, 2014 1 次提交
  13. 19 9月, 2014 5 次提交