1. 11 12月, 2016 2 次提交
    • V
      sched/core: Use load_avg for selecting idlest group · 6b94780e
      Vincent Guittot 提交于
      find_idlest_group() only compares the runnable_load_avg when looking
      for the least loaded group. But on fork intensive use case like
      hackbench where tasks blocked quickly after the fork, this can lead to
      selecting the same CPU instead of other CPUs, which have similar
      runnable load but a lower load_avg.
      
      When the runnable_load_avg of 2 CPUs are close, we now take into
      account the amount of blocked load as a 2nd selection factor. There is
      now 3 zones for the runnable_load of the rq:
      
       - [0 .. (runnable_load - imbalance)]:
      	Select the new rq which has significantly less runnable_load
      
       - [(runnable_load - imbalance) .. (runnable_load + imbalance)]:
      	The runnable loads are close so we use load_avg to chose
      	between the 2 rq
      
       - [(runnable_load + imbalance) .. ULONG_MAX]:
      	Keep the current rq which has significantly less runnable_load
      
      The scale factor that is currently used for comparing runnable_load,
      doesn't work well with small value. As an example, the use of a
      scaling factor fails as soon as this_runnable_load == 0 because we
      always select local rq even if min_runnable_load is only 1, which
      doesn't really make sense because they are just the same. So instead
      of scaling factor, we use an absolute margin for runnable_load to
      detect CPUs with similar runnable_load and we keep using scaling
      factor for blocked load.
      
      For use case like hackbench, this enable the scheduler to select
      different CPUs during the fork sequence and to spread tasks across the
      system.
      
      Tests have been done on a Hikey board (ARM based octo cores) for
      several kernel. The result below gives min, max, avg and stdev values
      of 18 runs with each configuration.
      
      The patches depend on the "no missing update_rq_clock()" work.
      
      hackbench -P -g 1
      
               ea86cb4b  7dc603c9  v4.8        v4.8+patches
        min    0.049         0.050         0.051       0,048
        avg    0.057         0.057(0%)     0.057(0%)   0,055(+5%)
        max    0.066         0.068         0.070       0,063
        stdev  +/-9%         +/-9%         +/-8%       +/-9%
      
      More performance numbers here:
      
        https://lkml.kernel.org/r/20161203214707.GI20785@codeblueprint.co.ukTested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: kernellwp@gmail.com
      Cc: umgwanakikbuti@gmail.com
      Cc: yuyang.du@intel.comc
      Link: http://lkml.kernel.org/r/1481216215-24651-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6b94780e
    • V
      sched/core: Fix find_idlest_group() for fork · f519a3f1
      Vincent Guittot 提交于
      During fork, the utilization of a task is init once the rq has been
      selected because the current utilization level of the rq is used to
      set the utilization of the fork task. As the task's utilization is
      still 0 at this step of the fork sequence, it doesn't make sense to
      look for some spare capacity that can fit the task's utilization.
      Furthermore, I can see perf regressions for the test:
      
         hackbench -P -g 1
      
      because the least loaded policy is always bypassed and tasks are not
      spread during fork.
      
      With this patch and the fix below, we are back to same performances as
      for v4.8. The fix below is only a temporary one used for the test
      until a smarter solution is found because we can't simply remove the
      test which is useful for others benchmarks
      
      | @@ -5708,13 +5708,6 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int t
      |
      |	avg_cost = this_sd->avg_scan_cost;
      |
      | -	/*
      | -	 * Due to large variance we need a large fuzz factor; hackbench in
      | -	 * particularly is sensitive here.
      | -	 */
      | -	if ((avg_idle / 512) < avg_cost)
      | -		return -1;
      | -
      |	time = local_clock();
      |
      |	for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
      Tested-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: kernellwp@gmail.com
      Cc: umgwanakikbuti@gmail.com
      Cc: yuyang.du@intel.comc
      Link: http://lkml.kernel.org/r/1481216215-24651-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f519a3f1
  2. 24 11月, 2016 2 次提交
  3. 23 11月, 2016 2 次提交
  4. 22 11月, 2016 2 次提交
    • O
      sched/autogroup: Do not use autogroup->tg in zombie threads · 8e5bfa8c
      Oleg Nesterov 提交于
      Exactly because for_each_thread() in autogroup_move_group() can't see it
      and update its ->sched_task_group before _put() and possibly free().
      
      So the exiting task needs another sched_move_task() before exit_notify()
      and we need to re-introduce the PF_EXITING (or similar) check removed by
      the previous change for another reason.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hartsjc@redhat.com
      Cc: vbendel@redhat.com
      Cc: vlovejoy@redhat.com
      Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8e5bfa8c
    • O
      sched/autogroup: Fix autogroup_move_group() to never skip sched_move_task() · 18f649ef
      Oleg Nesterov 提交于
      The PF_EXITING check in task_wants_autogroup() is no longer needed. Remove
      it, but see the next patch.
      
      However the comment is correct in that autogroup_move_group() must always
      change task_group() for every thread so the sysctl_ check is very wrong;
      we can race with cgroups and even sys_setsid() is not safe because a task
      running with task_group() == ag->tg must participate in refcounting:
      
      	int main(void)
      	{
      		int sctl = open("/proc/sys/kernel/sched_autogroup_enabled", O_WRONLY);
      
      		assert(sctl > 0);
      		if (fork()) {
      			wait(NULL); // destroy the child's ag/tg
      			pause();
      		}
      
      		assert(pwrite(sctl, "1\n", 2, 0) == 2);
      		assert(setsid() > 0);
      		if (fork())
      			pause();
      
      		kill(getppid(), SIGKILL);
      		sleep(1);
      
      		// The child has gone, the grandchild runs with kref == 1
      		assert(pwrite(sctl, "0\n", 2, 0) == 2);
      		assert(setsid() > 0);
      
      		// runs with the freed ag/tg
      		for (;;)
      			sleep(1);
      
      		return 0;
      	}
      
      crashes the kernel. It doesn't really need sleep(1), it doesn't matter if
      autogroup_move_group() actually frees the task_group or this happens later.
      Reported-by: NVern Lovejoy <vlovejoy@redhat.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hartsjc@redhat.com
      Cc: vbendel@redhat.com
      Link: http://lkml.kernel.org/r/20161114184609.GA15965@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      18f649ef
  5. 16 11月, 2016 12 次提交
  6. 15 11月, 2016 3 次提交
  7. 11 11月, 2016 1 次提交
  8. 03 11月, 2016 2 次提交
  9. 28 10月, 2016 1 次提交
    • L
      mm: remove per-zone hashtable of bitlock waitqueues · 9dcb8b68
      Linus Torvalds 提交于
      The per-zone waitqueues exist because of a scalability issue with the
      page waitqueues on some NUMA machines, but it turns out that they hurt
      normal loads, and now with the vmalloced stacks they also end up
      breaking gfs2 that uses a bit_wait on a stack object:
      
           wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
      
      where 'gh' can be a reference to the local variable 'mount_gh' on the
      stack of fill_super().
      
      The reason the per-zone hash table breaks for this case is that there is
      no "zone" for virtual allocations, and trying to look up the physical
      page to get at it will fail (with a BUG_ON()).
      
      It turns out that I actually complained to the mm people about the
      per-zone hash table for another reason just a month ago: the zone lookup
      also hurts the regular use of "unlock_page()" a lot, because the zone
      lookup ends up forcing several unnecessary cache misses and generates
      horrible code.
      
      As part of that earlier discussion, we had a much better solution for
      the NUMA scalability issue - by just making the page lock have a
      separate contention bit, the waitqueue doesn't even have to be looked at
      for the normal case.
      
      Peter Zijlstra already has a patch for that, but let's see if anybody
      even notices.  In the meantime, let's fix the actual gfs2 breakage by
      simplifying the bitlock waitqueues and removing the per-zone issue.
      Reported-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Tested-by: NBob Peterson <rpeterso@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9dcb8b68
  10. 27 10月, 2016 1 次提交
  11. 25 10月, 2016 2 次提交
    • P
      locking/mutex: Rework mutex::owner · 3ca0ff57
      Peter Zijlstra 提交于
      The current mutex implementation has an atomic lock word and a
      non-atomic owner field.
      
      This disparity leads to a number of issues with the current mutex code
      as it means that we can have a locked mutex without an explicit owner
      (because the owner field has not been set, or already cleared).
      
      This leads to a number of weird corner cases, esp. between the
      optimistic spinning and debug code. Where the optimistic spinning
      code needs the owner field updated inside the lock region, the debug
      code is more relaxed because the whole lock is serialized by the
      wait_lock.
      
      Also, the spinning code itself has a few corner cases where we need to
      deal with a held lock without an owner field.
      
      Furthermore, it becomes even more of a problem when trying to fix
      starvation cases in the current code. We end up stacking special case
      on special case.
      
      To solve this rework the basic mutex implementation to be a single
      atomic word that contains the owner and uses the low bits for extra
      state.
      
      This matches how PI futexes and rt_mutex already work. By having the
      owner an integral part of the lock state a lot of the problems
      dissapear and we get a better option to deal with starvation cases,
      direct owner handoff.
      
      Changing the basic mutex does however invalidate all the arch specific
      mutex code; this patch leaves that unused in-place, a later patch will
      remove that.
      Tested-by: NJason Low <jason.low2@hpe.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3ca0ff57
    • P
      sched/core: Explain sleep/wakeup in a better way · a2250238
      Peter Zijlstra 提交于
      There were a few questions wrt. how sleep-wakeup works. Try and explain
      it more.
      Requested-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a2250238
  12. 20 10月, 2016 1 次提交
  13. 19 10月, 2016 1 次提交
    • V
      sched/fair: Fix incorrect task group ->load_avg · b5a9b340
      Vincent Guittot 提交于
      A scheduler performance regression has been reported by Joseph Salisbury,
      which he bisected back to:
      
        3d30544f ("sched/fair: Apply more PELT fixes)
      
      The regression triggers when several levels of task groups are involved
      (read: SystemD) and cpu_possible_mask != cpu_present_mask.
      
      The root cause is that group entity's load (tg_child->se[i]->avg.load_avg)
      is initialized to scale_load_down(se->load.weight). During the creation of
      a child task group, its group entities on possible CPUs are attached to
      parent's cfs_rq (tg_parent) and their loads are added to the parent's load
      (tg_parent->load_avg) with update_tg_load_avg().
      
      But only the load on online CPUs will then be updated to reflect real load,
      whereas load on other CPUs will stay at the initial value.
      
      The result is a tg_parent->load_avg that is higher than the real load, the
      weight of group entities (tg_parent->se[i]->load.weight) on online CPUs is
      smaller than it should be, and the task group gets a less running time than
      what it could expect.
      
      ( This situation can be detected with /proc/sched_debug. The ".tg_load_avg"
        of the task group will be much higher than sum of ".tg_load_avg_contrib"
        of online cfs_rqs of the task group. )
      
      The load of group entities don't have to be intialized to something else
      than 0 because their load will increase when an entity is attached.
      Reported-by: NJoseph Salisbury <joseph.salisbury@canonical.com>
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@vger.kernel.org> # 4.8.x
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: joonwoop@codeaurora.org
      Fixes: 3d30544f ("sched/fair: Apply more PELT fixes)
      Link: http://lkml.kernel.org/r/1476881123-10159-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b5a9b340
  14. 11 10月, 2016 2 次提交
    • W
      sched/fair: Fix sched domains NULL dereference in select_idle_sibling() · 9cfb38a7
      Wanpeng Li 提交于
      Commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... improved select_idle_sibling(), but also triggered a regression (crash)
      during CPU-hotplug:
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
        IP: [<ffffffffb10cd332>] select_idle_sibling+0x1c2/0x4f0
        Call Trace:
         <IRQ>
          select_task_rq_fair+0x749/0x930
          ? select_task_rq_fair+0xb4/0x930
          ? __lock_is_held+0x54/0x70
          try_to_wake_up+0x19a/0x5b0
          default_wake_function+0x12/0x20
          autoremove_wake_function+0x12/0x40
          __wake_up_common+0x55/0x90
          __wake_up+0x39/0x50
          wake_up_klogd_work_func+0x40/0x60
          irq_work_run_list+0x57/0x80
          irq_work_run+0x2c/0x30
          smp_irq_work_interrupt+0x2e/0x40
          irq_work_interrupt+0x96/0xa0
         <EOI>
          ? _raw_spin_unlock_irqrestore+0x45/0x80
          try_to_wake_up+0x4a/0x5b0
          wake_up_state+0x10/0x20
          __kthread_unpark+0x67/0x70
          kthread_unpark+0x22/0x30
          cpuhp_online_idle+0x3e/0x70
          cpu_startup_entry+0x6a/0x450
          start_secondary+0x154/0x180
      
      This can be reproduced by running the ftrace test case of kselftest, the
      test case will hot-unplug the CPU and the CPU will attach to the NULL
      sched-domain during scheduler teardown.
      
      The step 2 for the rewrite select_idle_siblings():
      
        | Step 2) tracks the average cost of the scan and compares this to the
        | average idle time guestimate for the CPU doing the wakeup.
      
      If the CPU which doing the wakeup is the going hot-unplug CPU, then NULL
      sched domain will be dereferenced to acquire the average cost of the scan.
      
      This patch fix it by failing the search of an idle CPU in the LLC process
      if this sched domain is NULL.
      Tested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1475971443-3187-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cfb38a7
    • E
      latent_entropy: Mark functions with __latent_entropy · 0766f788
      Emese Revfy 提交于
      The __latent_entropy gcc attribute can be used only on functions and
      variables.  If it is on a function then the plugin will instrument it for
      gathering control-flow entropy. If the attribute is on a variable then
      the plugin will initialize it with random contents.  The variable must
      be an integer, an integer array type or a structure with integer fields.
      
      These specific functions have been selected because they are init
      functions (to help gather boot-time entropy), are called at unpredictable
      times, or they have variable loops, each of which provide some level of
      latent entropy.
      Signed-off-by: NEmese Revfy <re.emese@gmail.com>
      [kees: expanded commit message]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      0766f788
  15. 08 10月, 2016 1 次提交
  16. 30 9月, 2016 5 次提交