1. 28 10月, 2016 1 次提交
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  2. 30 9月, 2016 11 次提交
    • P
      sched/core: Fix set_user_nice() · 49bd21ef
      Peter Zijlstra 提交于
      Almost all scheduler functions update state with the following
      pattern:
      
      	if (queued)
      		dequeue_task(rq, p, DEQUEUE_SAVE);
      	if (running)
      		put_prev_task(rq, p);
      
      	/* update state */
      
      	if (queued)
      		enqueue_task(rq, p, ENQUEUE_RESTORE);
      	if (running)
      		set_curr_task(rq, p);
      
      set_user_nice() however misses the running part, cure this.
      
      This was found by asserting we never enqueue 'current'.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      49bd21ef
    • P
      sched/fair: Introduce set_curr_task() helper · b2bf6c31
      Peter Zijlstra 提交于
      Now that the ia64 only set_curr_task() symbol is gone, provide a
      helper just like put_prev_task().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b2bf6c31
    • P
      sched/core, ia64: Rename set_curr_task() · a458ae2e
      Peter Zijlstra 提交于
      Rename the ia64 only set_curr_task() function to free up the name.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a458ae2e
    • V
      sched/core: Fix incorrect utilization accounting when switching to fair class · a399d233
      Vincent Guittot 提交于
      When a task switches to fair scheduling class, the period between now
      and the last update of its utilization is accounted as running time
      whatever happened during this period. This incorrect accounting applies
      to the task and also to the task group branch.
      
      When changing the property of a running task like its list of allowed
      CPUs or its scheduling class, we follow the sequence:
      
       - dequeue task
       - put task
       - change the property
       - set task as current task
       - enqueue task
      
      The end of the sequence doesn't follow the normal sequence (as per
      __schedule()) which is:
      
       - enqueue a task
       - then set the task as current task.
      
      This incorrectordering is the root cause of incorrect utilization accounting.
      Update the sequence to follow the right one:
      
       - dequeue task
       - put task
       - change the property
       - enqueue task
       - set task as current task
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1473666472-13749-8-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a399d233
    • P
      sched/core: Optimize SCHED_SMT · 1b568f0a
      Peter Zijlstra 提交于
      Avoid pointless SCHED_SMT code when running on !SMT hardware.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1b568f0a
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • P
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra 提交于
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      24fc7edb
    • P
      sched/core: Restructure destroy_sched_domain() · 16f3ef46
      Peter Zijlstra 提交于
      There is no point in doing a call_rcu() for each domain, only do a
      callback for the root sched domain and clean up the entire set in one
      go.
      
      Also make the entire call chain be called destroy_sched_domain*() to
      remove confusion with the free_sched_domains() call, which does an
      entirely different thing.
      
      Both cpu_attach_domain() callers of destroy_sched_domain() can live
      without the call_rcu() because at those points the sched_domain hasn't
      been published yet.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      16f3ef46
    • P
      sched/core: Remove unused @cpu argument from destroy_sched_domain*() · f39180ef
      Peter Zijlstra 提交于
      Small cleanup; nothing uses the @cpu argument so make it go away.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f39180ef
    • T
      sched/core, x86/topology: Fix NUMA in package topology bug · 8f37961c
      Tim Chen 提交于
      Current code can call set_cpu_sibling_map() and invoke sched_set_topology()
      more than once (e.g. on CPU hot plug).  When this happens after
      sched_init_smp() has been called, we lose the NUMA topology extension to
      sched_domain_topology in sched_init_numa().  This results in incorrect
      topology when the sched domain is rebuilt.
      
      This patch fixes the bug and issues warning if we call sched_set_topology()
      after sched_init_smp().
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bp@suse.de
      Cc: jolsa@redhat.com
      Cc: rjw@rjwysocki.net
      Link: http://lkml.kernel.org/r/1474485552-141429-2-git-send-email-srinivas.pandruvada@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8f37961c
  3. 22 9月, 2016 5 次提交
  4. 16 9月, 2016 1 次提交
    • A
      sched/core: Free the stack early if CONFIG_THREAD_INFO_IN_TASK · 68f24b08
      Andy Lutomirski 提交于
      We currently keep every task's stack around until the task_struct
      itself is freed.  This means that we keep the stack allocation alive
      for longer than necessary and that, under load, we free stacks in
      big batches whenever RCU drops the last task reference.  Neither of
      these is good for reuse of cache-hot memory, and freeing in batches
      prevents us from usefully caching small numbers of vmalloced stacks.
      
      On architectures that have thread_info on the stack, we can't easily
      change this, but on architectures that set THREAD_INFO_IN_TASK, we
      can free it as soon as the task is dead.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jann Horn <jann@thejh.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/08ca06cde00ebed0046c5d26cbbf3fbb7ef5b812.1474003868.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68f24b08
  5. 05 9月, 2016 4 次提交
    • J
      sched/debug: Remove several CONFIG_SCHEDSTATS guards · 4fa8d299
      Josh Poimboeuf 提交于
      Clean up the sched code by removing several of the CONFIG_SCHEDSTATS
      guards, using schedstat_*() macros where needed.
      
      Code size:
      
        !CONFIG_SCHEDSTATS defconfig:
      
            text	   data	    bss	     dec	    hex	filename
        10209818	4368184	1105920	15683922	 ef5152	vmlinux.before.nostats
        10209818	4368184	1105920	15683922	 ef5152	vmlinux.after.nostats
      
        CONFIG_SCHEDSTATS defconfig:
      
            text	   data	    bss	    dec	    hex	filename
        10214210	4370040	1105920	15690170	 ef69ba	vmlinux.before.stats
        10214210	4370680	1105920	15690810	 ef6c3a	vmlinux.after.stats
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/e51e0ebe5af95ac295de720dd252e7c0d2142e4a.1466184592.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4fa8d299
    • J
      sched/debug: Clean up schedstat macros · ae92882e
      Josh Poimboeuf 提交于
      The schedstat_*() macros are inconsistent: most of them take a pointer
      and a field which the macro combines, whereas schedstat_set() takes the
      already combined ptr->field.
      
      The already combined ptr->field argument is actually more intuitive and
      easier to use, and there's no reason to require the user to split the
      variable up, so convert the macros to use the combined argument.
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/54953ca25bb579f3a5946432dee409b0e05222c6.1466184592.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ae92882e
    • S
      schedcore: Remove duplicated init_task's preempt_notifiers init · efca03ec
      seokhoon.yoon 提交于
      init_task's preempt_notifiers is initialized twice:
      
      1) sched_init()
         -> INIT_HLIST_HEAD(&init_task.preempt_notifiers)
      
      2) sched_init()
         -> init_idle(current,) <--- current task is init_task at this time
          -> __sched_fork(,current)
           -> INIT_HLIST_HEAD(&p->preempt_notifiers)
      
      I think the first one is unnecessary, so remove it.
      Signed-off-by: Nseokhoon.yoon <iamyooon@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1471339568-5790-1-git-send-email-iamyooon@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efca03ec
    • B
      sched/core: Fix a race between try_to_wake_up() and a woken up task · 135e8c92
      Balbir Singh 提交于
      The origin of the issue I've seen is related to
      a missing memory barrier between check for task->state and
      the check for task->on_rq.
      
      The task being woken up is already awake from a schedule()
      and is doing the following:
      
      	do {
      		schedule()
      		set_current_state(TASK_(UN)INTERRUPTIBLE);
      	} while (!cond);
      
      The waker, actually gets stuck doing the following in
      try_to_wake_up():
      
      	while (p->on_cpu)
      		cpu_relax();
      
      Analysis:
      
      The instance I've seen involves the following race:
      
       CPU1					CPU2
      
       while () {
         if (cond)
           break;
         do {
           schedule();
           set_current_state(TASK_UN..)
         } while (!cond);
      					wakeup_routine()
      					  spin_lock_irqsave(wait_lock)
         raw_spin_lock_irqsave(wait_lock)	  wake_up_process()
       }					  try_to_wake_up()
       set_current_state(TASK_RUNNING);	  ..
       list_del(&waiter.list);
      
      CPU2 wakes up CPU1, but before it can get the wait_lock and set
      current state to TASK_RUNNING the following occurs:
      
       CPU3
       wakeup_routine()
       raw_spin_lock_irqsave(wait_lock)
       if (!list_empty)
         wake_up_process()
         try_to_wake_up()
         raw_spin_lock_irqsave(p->pi_lock)
         ..
         if (p->on_rq && ttwu_wakeup())
         ..
         while (p->on_cpu)
           cpu_relax()
         ..
      
      CPU3 tries to wake up the task on CPU1 again since it finds
      it on the wait_queue, CPU1 is spinning on wait_lock, but immediately
      after CPU2, CPU3 got it.
      
      CPU3 checks the state of p on CPU1, it is TASK_UNINTERRUPTIBLE and
      the task is spinning on the wait_lock. Interestingly since p->on_rq
      is checked under pi_lock, I've noticed that try_to_wake_up() finds
      p->on_rq to be 0. This was the most confusing bit of the analysis,
      but p->on_rq is changed under runqueue lock, rq_lock, the p->on_rq
      check is not reliable without this fix IMHO. The race is visible
      (based on the analysis) only when ttwu_queue() does a remote wakeup
      via ttwu_queue_remote. In which case the p->on_rq change is not
      done uder the pi_lock.
      
      The result is that after a while the entire system locks up on
      the raw_spin_irqlock_save(wait_lock) and the holder spins infintely
      
      Reproduction of the issue:
      
      The issue can be reproduced after a long run on my system with 80
      threads and having to tweak available memory to very low and running
      memory stress-ng mmapfork test. It usually takes a long time to
      reproduce. I am trying to work on a test case that can reproduce
      the issue faster, but thats work in progress. I am still testing the
      changes on my still in a loop and the tests seem OK thus far.
      
      Big thanks to Benjamin and Nick for helping debug this as well.
      Ben helped catch the missing barrier, Nick caught every missing
      bit in my theory.
      Signed-off-by: NBalbir Singh <bsingharora@gmail.com>
      [ Updated comment to clarify matching barriers. Many
        architectures do not have a full barrier in switch_to()
        so that cannot be relied upon. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Nicholas Piggin <nicholas.piggin@gmail.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e02cce7b-d9ca-1ad0-7a61-ea97c7582b37@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      135e8c92
  6. 24 8月, 2016 1 次提交
  7. 23 8月, 2016 1 次提交
    • P
      sched: Make wake_up_nohz_cpu() handle CPUs going offline · 379d9ecb
      Paul E. McKenney 提交于
      Both timers and hrtimers are maintained on the outgoing CPU until
      CPU_DEAD time, at which point they are migrated to a surviving CPU.  If a
      mod_timer() executes between CPU_DYING and CPU_DEAD time, x86 systems
      will splat in native_smp_send_reschedule() when attempting to wake up
      the just-now-offlined CPU, as shown below from a NO_HZ_FULL kernel:
      
      [ 7976.741556] WARNING: CPU: 0 PID: 661 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:125 native_smp_send_reschedule+0x39/0x40
      [ 7976.741595] Modules linked in:
      [ 7976.741595] CPU: 0 PID: 661 Comm: rcu_torture_rea Not tainted 4.7.0-rc2+ #1
      [ 7976.741595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      [ 7976.741595]  0000000000000000 ffff88000002fcc8 ffffffff8138ab2e 0000000000000000
      [ 7976.741595]  0000000000000000 ffff88000002fd08 ffffffff8105cabc 0000007d1fd0ee18
      [ 7976.741595]  0000000000000001 ffff88001fd16d40 ffff88001fd0ee00 ffff88001fd0ee00
      [ 7976.741595] Call Trace:
      [ 7976.741595]  [<ffffffff8138ab2e>] dump_stack+0x67/0x99
      [ 7976.741595]  [<ffffffff8105cabc>] __warn+0xcc/0xf0
      [ 7976.741595]  [<ffffffff8105cb98>] warn_slowpath_null+0x18/0x20
      [ 7976.741595]  [<ffffffff8103cba9>] native_smp_send_reschedule+0x39/0x40
      [ 7976.741595]  [<ffffffff81089bc2>] wake_up_nohz_cpu+0x82/0x190
      [ 7976.741595]  [<ffffffff810d275a>] internal_add_timer+0x7a/0x80
      [ 7976.741595]  [<ffffffff810d3ee7>] mod_timer+0x187/0x2b0
      [ 7976.741595]  [<ffffffff810c89dd>] rcu_torture_reader+0x33d/0x380
      [ 7976.741595]  [<ffffffff810c66f0>] ? sched_torture_read_unlock+0x30/0x30
      [ 7976.741595]  [<ffffffff810c86a0>] ? rcu_bh_torture_read_lock+0x80/0x80
      [ 7976.741595]  [<ffffffff8108068f>] kthread+0xdf/0x100
      [ 7976.741595]  [<ffffffff819dd83f>] ret_from_fork+0x1f/0x40
      [ 7976.741595]  [<ffffffff810805b0>] ? kthread_create_on_node+0x200/0x200
      
      However, in this case, the wakeup is redundant, because the timer
      migration will reprogram timer hardware as needed.  Note that the fact
      that preemption is disabled does not avoid the splat, as the offline
      operation has already passed both the synchronize_sched() and the
      stop_machine() that would be blocked by disabled preemption.
      
      This commit therefore modifies wake_up_nohz_cpu() to avoid attempting
      to wake up offline CPUs.  It also adds a comment stating that the
      caller must tolerate lost wakeups when the target CPU is going offline,
      and suggesting the CPU_DEAD notifier as a recovery mechanism.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      379d9ecb
  8. 18 8月, 2016 6 次提交
  9. 10 8月, 2016 5 次提交
    • V
      sched/debug: Add taint on "BUG: Sleeping function called from invalid context" · f0b22e39
      Vegard Nossum 提交于
      Seeing this, it occurs to me that we should probably add a taint here:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17160 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17198 ffffffff81158067 0000000000000de6
           ffff88011a3c4c80 ffffffff8390e07c 0000000000000184 0000000000000000
          Call Trace:
          [...]
      
          BUG: sleeping function called from invalid context at arch/x86/mm/fault.c:1309
          in_atomic(): 0, irqs_disabled(): 0, pid: 32211, name: trinity-c3
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          CPU: 3 PID: 32211 Comm: trinity-c3 Not tainted 4.7.0-rc7+ #19
                                             ^^^^^^^^^^^
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
           0000000000000000 ffff8800b8a17e08 ffffffff81971441 ffff88011a3c4c80
           ffff88011a3c4c80 ffff8800b8a17e40 ffffffff81158067 0000000000000000
           ffff88011a3c4c80 ffffffff83437b20 000000000000051d 0000000000000000
          Call Trace:
          [...]
      Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Link: http://lkml.kernel.org/r/1469216762-19626-1-git-send-email-vegard.nossum@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b22e39
    • V
      sched/debug: Make the "Preemption disabled at ..." message more useful · d1c6d149
      Vegard Nossum 提交于
      This message is currently really useless since it always prints a value
      that comes from the printk() we just did, e.g.:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff8119db33>] down_trylock+0x13/0x80
      
          BUG: sleeping function called from invalid context at include/linux/freezer.h:56
          in_atomic(): 0, irqs_disabled(): 0, pid: 31996, name: trinity-c1
          Preemption disabled at:[<ffffffff811aaa37>] console_unlock+0x2f7/0x930
      
      Here, both down_trylock() and console_unlock() is somewhere in the
      printk() path.
      
      We should save the value before calling printk() and use the saved value
      instead. That immediately reveals the offending callsite:
      
          BUG: sleeping function called from invalid context at mm/slab.h:388
          in_atomic(): 0, irqs_disabled(): 0, pid: 14971, name: trinity-c2
          Preemption disabled at:[<ffffffff819bcd46>] rhashtable_walk_start+0x46/0x150
      
      Bug report:
      
        http://marc.info/?l=linux-netdev&m=146925979821849&w=2Signed-off-by: NVegard Nossum <vegard.nossum@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russel <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d1c6d149
    • L
      sched/core: Add documentation for 'cookie' argument · 9279e0d2
      Luis de Bethencourt 提交于
      Add documentation for the cookie argument in try_to_wake_up_local().
      
      This caused the following warning when building documentation:
      
        kernel/sched/core.c:2088: warning: No description found for parameter 'cookie'
      Signed-off-by: NLuis de Bethencourt <luisbg@osg.samsung.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Fixes: e7904a28 ("ilocking/lockdep, sched/core: Implement a better lock pinning scheme")
      Link: http://lkml.kernel.org/r/1468159226-17674-1-git-send-email-luisbg@osg.samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9279e0d2
    • L
      sched/core: Fix one typo · a1fd4656
      Leo Yan 提交于
      Fix one minor typo in the comment: s/targer/target/.
      Signed-off-by: NLeo Yan <leo.yan@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1470378758-15066-1-git-send-email-leo.yan@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a1fd4656
    • G
      sched/cputime: Mitigate performance regression in times()/clock_gettime() · 6075620b
      Giovanni Gherdovich 提交于
      Commit:
      
        6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency")
      
      fixed a problem whereby clock_nanosleep() followed by clock_gettime() could
      allow a task to wake early. It addressed the problem by calling the scheduling
      classes update_curr() when the cputimer starts.
      
      Said change induced a considerable performance regression on the syscalls
      times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some
      debuggers and applications that monitor their own performance that
      accidentally depend on the performance of these specific calls.
      
      This patch mitigates the performace loss by prefetching data in the CPU
      cache, as stalls due to cache misses appear to be where most time is spent
      in our benchmarks.
      
      Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge
      box with 32 logical cores and 2 NUMA nodes. The test is repeated with a
      variable number of threads, from 2 to 4*num_cpus; the results are in
      seconds and correspond to the average of 10 runs; the percentage gain is
      computed with (before-after)/before so a positive value is an improvement
      (it's faster). The improvement varies between a few percents for 5-20
      threads and more than 10% for 2 or >20 threads.
      
      pound_clock_gettime:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.48        3.06 ( 11.83%)
            5           3.33        3.25 (  2.40%)
            8           3.37        3.26 (  3.30%)
           12           3.32        3.37 ( -1.60%)
           21           4.01        3.90 (  2.74%)
           30           3.63        3.36 (  7.41%)
           48           3.71        3.11 ( 16.27%)
           79           3.75        3.16 ( 15.74%)
          110           3.81        3.25 ( 14.80%)
          128           3.88        3.31 ( 14.76%)
      
      pound_times:
      
          threads       4.7-rc7     patched 4.7-rc7
          [num]         [secs]      [secs (percent)]
            2           3.65        3.25 ( 11.03%)
            5           3.45        3.17 (  7.92%)
            8           3.52        3.22 (  8.69%)
           12           3.29        3.36 ( -2.04%)
           21           4.07        3.92 (  3.78%)
           30           3.87        3.40 ( 12.17%)
           48           3.79        3.16 ( 16.61%)
           79           3.88        3.28 ( 15.42%)
          110           3.90        3.38 ( 13.35%)
          128           4.00        3.38 ( 15.45%)
      
      pound_clock_gettime and pound_clock_gettime are two benchmarks included in
      the MMTests framework. They launch a given number of threads which
      repeatedly call times() or clock_gettimes(). The results above can be
      reproduced with cloning MMTests from github.com and running the "poundtime"
      workload:
      
        $ git clone https://github.com/gormanm/mmtests.git
        $ cd mmtests
        $ cp configs/config-global-dhp__workload_poundtime config
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      The above will run "poundtime" measuring the kernel currently running on
      the machine; Once a new kernel is installed and the machine rebooted,
      running again
      
        $ cd mmtests
        $ ./run-mmtests.sh --run-monitor $(uname -r)
      
      will produce results to compare with. A comparison table will be output
      with:
      
        $ cd mmtests/work/log
        $ ../../compare-kernels.sh
      
      the table will contain a lot of entries; grepping for "Amean" (as in
      "arithmetic mean") will give the tables presented above. The source code
      for the two benchmarks is reported at the end of this changelog for
      clairity.
      
      The cache misses addressed by this patch were found using a combination of
      `perf top`, `perf record` and `perf annotate`. The incriminated lines were
      found to be
      
          struct sched_entity *curr = cfs_rq->curr;
      
      and
      
          delta_exec = now - curr->exec_start;
      
      in the function update_curr() from kernel/sched/fair.c. This patch
      prefetches the data from memory just before update_curr is called in the
      interested execution path.
      
      A comparison of the total number of cycles before and after the patch
      follows; the data is obtained using `perf stat -r 10 -ddd <program>`
      running over the same sequence of number of threads used above (a positive
      gain is an improvement):
      
        threads   cycles before                 cycles after                gain
      
          2      19,699,563,964  +-1.19%      17,358,917,517  +-1.85%      11.88%
          5      47,401,089,566  +-2.96%      45,103,730,829  +-0.97%       4.85%
          8      80,923,501,004  +-3.01%      71,419,385,977  +-0.77%      11.74%
         12     112,326,485,473  +-0.47%     110,371,524,403  +-0.47%       1.74%
         21     193,455,574,299  +-0.72%     180,120,667,904  +-0.36%       6.89%
         30     315,073,519,013  +-1.64%     271,222,225,950  +-1.29%      13.92%
         48     321,969,515,332  +-1.48%     273,353,977,321  +-1.16%      15.10%
         79     337,866,003,422  +-0.97%     289,462,481,538  +-1.05%      14.33%
        110     338,712,691,920  +-0.78%     290,574,233,170  +-0.77%      14.21%
        128     348,384,794,006  +-0.50%     292,691,648,206  +-0.66%      15.99%
      
      A comparison of cache miss vs total cache loads ratios, before and after
      the patch (again from the `perf stat -r 10 -ddd <program>` tables):
      
        threads   L1 misses/total*100     L1 misses/total*100            gain
      		         before                   after
            2           7.43  +-4.90%           7.36  +-4.70%           0.94%
            5          13.09  +-4.74%          13.52  +-3.73%          -3.28%
            8          13.79  +-5.61%          12.90  +-3.27%           6.45%
           12          11.57  +-2.44%           8.71  +-1.40%          24.72%
           21          12.39  +-3.92%           9.97  +-1.84%          19.53%
           30          13.91  +-2.53%          11.73  +-2.28%          15.67%
           48          13.71  +-1.59%          12.32  +-1.97%          10.14%
           79          14.44  +-0.66%          13.40  +-1.06%           7.20%
          110          15.86  +-0.50%          14.46  +-0.59%           8.83%
          128          16.51  +-0.32%          15.06  +-0.78%           8.78%
      
      As a final note, the following shows the evolution of performance figures
      in the "poundtime" benchmark and pinpoints commit 6e998916
      ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a
      major source of degradation, mostly unaddressed to this day (figures
      expressed in seconds).
      
      pound_clock_gettime:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.23          3.68 ( -64.56%)        3.48 (-55.48%)
          5        2.83          3.78 ( -33.42%)        3.33 (-17.43%)
          8        2.84          4.31 ( -52.12%)        3.37 (-18.76%)
          12       3.09          3.61 ( -16.74%)        3.32 ( -7.17%)
          21       3.14          4.63 ( -47.36%)        4.01 (-27.71%)
          30       3.28          5.75 ( -75.37%)        3.63 (-10.80%)
          48       3.02          6.05 (-100.56%)        3.71 (-22.99%)
          79       2.88          6.30 (-118.90%)        3.75 (-30.26%)
          110      2.95          6.46 (-119.00%)        3.81 (-29.24%)
          128      3.05          6.42 (-110.08%)        3.88 (-27.04%)
      
      pound_times:
      
        threads   parent of         6e998916        4.7-rc7
      	    6e998916            itself
          2        2.27          3.73 ( -64.71%)        3.65 (-61.14%)
          5        2.78          3.77 ( -35.56%)        3.45 (-23.98%)
          8        2.79          4.41 ( -57.71%)        3.52 (-26.05%)
          12       3.02          3.56 ( -17.94%)        3.29 ( -9.08%)
          21       3.10          4.61 ( -48.74%)        4.07 (-31.34%)
          30       3.33          5.75 ( -72.53%)        3.87 (-16.01%)
          48       2.96          6.06 (-105.04%)        3.79 (-28.10%)
          79       2.88          6.24 (-116.83%)        3.88 (-34.81%)
          110      2.98          6.37 (-114.08%)        3.90 (-31.12%)
          128      3.10          6.35 (-104.61%)        4.00 (-28.87%)
      
      The source code of the two benchmarks follows. To compile the two:
      
        NR_THREADS=42
        for FILE in pound_times pound_clock_gettime; do
            gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE
        done
      
      ==== BEGIN pound_times.c ====
      
      struct tms start;
      
      void *pound (void *threadid)
      {
        struct tms end;
        int oldutime = 0;
        int utime;
        int i;
        for (i = 0; i < 5000000 / NUM_THREADS; i++) {
                times(&end);
                utime = ((int)end.tms_utime - (int)start.tms_utime);
                if (oldutime > utime) {
                  printf("utime decreased, was %d, now %d!\n", oldutime, utime);
                }
                oldutime = utime;
        }
        pthread_exit(NULL);
      }
      
      int main()
      {
        pthread_t th[NUM_THREADS];
        long i;
        times(&start);
        for (i = 0; i < NUM_THREADS; i++) {
          pthread_create (&th[i], NULL, pound, (void *)i);
        }
        pthread_exit(NULL);
        return 0;
      }
      ==== END pound_times.c ====
      
      ==== BEGIN pound_clock_gettime.c ====
      
      void *pound (void *threadid)
      {
      	struct timespec ts;
      	int rc, i;
      	unsigned long prev = 0, this = 0;
      
      	for (i = 0; i < 5000000 / NUM_THREADS; i++) {
      		rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
      		if (rc < 0)
      			perror("clock_gettime");
      		this = (ts.tv_sec * 1000000000) + ts.tv_nsec;
      		if (0 && this < prev)
      			printf("%lu ns timewarp at iteration %d\n", prev - this, i);
      		prev = this;
      	}
      	pthread_exit(NULL);
      }
      
      int main()
      {
      	pthread_t th[NUM_THREADS];
      	long rc, i;
      	pid_t pgid;
      
      	for (i = 0; i < NUM_THREADS; i++) {
      		rc = pthread_create(&th[i], NULL, pound, (void *)i);
      		if (rc < 0)
      			perror("pthread_create");
      	}
      
      	pthread_exit(NULL);
      	return 0;
      }
      ==== END pound_clock_gettime.c ====
      Suggested-by: NMike Galbraith <mgalbraith@suse.de>
      Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6075620b
  10. 13 7月, 2016 1 次提交
  11. 11 7月, 2016 1 次提交
  12. 27 6月, 2016 3 次提交
    • Z
      sched/core: Fix sched_getaffinity() return value kerneldoc comment · 599b4840
      Zev Weiss 提交于
      Previous version was probably written referencing the man page for
      glibc's wrapper, but the wrapper's behavior differs from that of the
      syscall itself in this case.
      Signed-off-by: NZev Weiss <zev@bewilderbeest.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      599b4840
    • P
      sched/fair: Reorder cgroup creation code · 8663e24d
      Peter Zijlstra 提交于
      A future patch needs rq->lock held _after_ we link the task_group into
      the hierarchy. In order to avoid taking every rq->lock twice, reorder
      things a little and create online_fair_sched_group() to be called
      after we link the task_group.
      
      All this code is still ran from css_alloc() so css_online() isn't in
      fact used for this.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8663e24d
    • P
      sched/fair: Fix PELT integrity for new tasks · 7dc603c9
      Peter Zijlstra 提交于
      Vincent and Yuyang found another few scenarios in which entity
      tracking goes wobbly.
      
      The scenarios are basically due to the fact that new tasks are not
      immediately attached and thereby differ from the normal situation -- a
      task is always attached to a cfs_rq load average (such that it
      includes its blocked contribution) and are explicitly
      detached/attached on migration to another cfs_rq.
      
      Scenario 1: switch to fair class
      
        p->sched_class = fair_class;
        if (queued)
          enqueue_task(p);
            ...
              enqueue_entity()
      	  enqueue_entity_load_avg()
      	    migrated = !sa->last_update_time (true)
      	    if (migrated)
      	      attach_entity_load_avg()
        check_class_changed()
          switched_from() (!fair)
          switched_to()   (fair)
            switched_to_fair()
              attach_entity_load_avg()
      
      If @p is a new task that hasn't been fair before, it will have
      !last_update_time and, per the above, end up in
      attach_entity_load_avg() _twice_.
      
      Scenario 2: change between cgroups
      
        sched_move_group(p)
          if (queued)
            dequeue_task()
          task_move_group_fair()
            detach_task_cfs_rq()
              detach_entity_load_avg()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
          if (queued)
            enqueue_task();
              ...
                enqueue_entity()
      	    enqueue_entity_load_avg()
      	      migrated = !sa->last_update_time (true)
      	      if (migrated)
      	        attach_entity_load_avg()
      
      Similar as with scenario 1, if @p is a new task, it will have
      !load_update_time and we'll end up in attach_entity_load_avg()
      _twice_.
      
      Furthermore, notice how we do a detach_entity_load_avg() on something
      that wasn't attached to begin with.
      
      As stated above; the problem is that the new task isn't yet attached
      to the load tracking and thereby violates the invariant assumption.
      
      This patch remedies this by ensuring a new task is indeed properly
      attached to the load tracking on creation, through
      post_init_entity_util_avg().
      
      Of course, this isn't entirely as straightforward as one might think,
      since the task is hashed before we call wake_up_new_task() and thus
      can be poked at. We avoid this by adding TASK_NEW and teaching
      cpu_cgroup_can_attach() to refuse such tasks.
      Reported-by: NYuyang Du <yuyang.du@intel.com>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7dc603c9