1. 13 11月, 2019 1 次提交
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151
  2. 06 11月, 2019 1 次提交
    • F
      sched/vtime: Fix guest/system mis-accounting on task switch · 2cd003a8
      Frederic Weisbecker 提交于
      [ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ]
      
      vtime_account_system() assumes that the target task to account cputime
      to is always the current task. This is most often true indeed except on
      task switch where we call:
      
      	vtime_common_task_switch(prev)
      		vtime_account_system(prev)
      
      Here prev is the scheduling-out task where we account the cputime to. It
      doesn't match current that is already the scheduling-in task at this
      stage of the context switch.
      
      So we end up checking the wrong task flags to determine if we are
      accounting guest or system time to the previous task.
      
      As a result the wrong task is used to check if the target is running in
      guest mode. We may then spuriously account or leak either system or
      guest time on task switch.
      
      Fix this assumption and also turn vtime_guest_enter/exit() to use the
      task passed in parameter as well to avoid future similar issues.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Fixes: 2a42eb95 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
      Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2cd003a8
  3. 12 10月, 2019 2 次提交
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 46ff0e2f
      KeMeng Shi 提交于
      [ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46ff0e2f
    • M
      sched/membarrier: Fix private expedited registration check · 6cb7aa1b
      Mathieu Desnoyers 提交于
      [ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ]
      
      Fix a logic flaw in the way membarrier_register_private_expedited()
      handles ready state checks for private expedited sync core and private
      expedited registrations.
      
      If a private expedited membarrier registration is first performed, and
      then a private expedited sync_core registration is performed, the ready
      state check will skip the second registration when it really should not.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cb7aa1b
  4. 05 10月, 2019 7 次提交
    • D
      sched/cpufreq: Align trace event behavior of fast switching · 01e8f487
      Douglas RAILLARD 提交于
      [ Upstream commit 77c84dd1881d0f0176cb678d770bfbda26c54390 ]
      
      Fast switching path only emits an event for the CPU of interest, whereas the
      regular path emits an event for all the CPUs that had their frequency changed,
      i.e. all the CPUs sharing the same policy.
      
      With the current behavior, looking at cpu_frequency event for a given CPU that
      is using the fast switching path will not give the correct frequency signal.
      Signed-off-by: NDouglas RAILLARD <douglas.raillard@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      01e8f487
    • P
      idle: Prevent late-arriving interrupts from disrupting offline · 51111023
      Peter Zijlstra 提交于
      [ Upstream commit e78a7614f3876ac649b3df608789cb6ef74d0480 ]
      
      Scheduling-clock interrupts can arrive late in the CPU-offline process,
      after idle entry and the subsequent call to cpuhp_report_idle_dead().
      Once execution passes the call to rcu_report_dead(), RCU is ignoring
      the CPU, which results in lockdep complaints when the interrupt handler
      uses RCU:
      
      ------------------------------------------------------------------------
      
      =============================
      WARNING: suspicious RCU usage
      5.2.0-rc1+ #681 Not tainted
      -----------------------------
      kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from offline CPU!
      rcu_scheduler_active = 2, debug_locks = 1
      no locks held by swapper/5/0.
      
      stack backtrace:
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       dump_stack+0x5e/0x8b
       trigger_load_balance+0xa8/0x390
       ? tick_sched_do_timer+0x60/0x60
       update_process_times+0x3b/0x50
       tick_sched_handle+0x2f/0x40
       tick_sched_timer+0x32/0x70
       __hrtimer_run_queues+0xd3/0x3b0
       hrtimer_interrupt+0x11d/0x270
       ? sched_clock_local+0xc/0x74
       smp_apic_timer_interrupt+0x79/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:delay_tsc+0x22/0x50
      Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
      RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
      RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
      RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
       cpuhp_report_idle_dead+0x31/0x60
       do_idle+0x1d5/0x200
       ? _raw_spin_unlock_irqrestore+0x2d/0x40
       cpu_startup_entry+0x14/0x20
       start_secondary+0x151/0x170
       secondary_startup_64+0xa4/0xb0
      
      ------------------------------------------------------------------------
      
      This happens rarely, but can be forced by happen more often by
      placing delays in cpuhp_report_idle_dead() following the call to
      rcu_report_dead().  With this in place, the following rcutorture
      scenario reproduces the problem within a few minutes:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"
      
      This commit uses the crude but effective expedient of moving the disabling
      of interrupts within the idle loop to precede the cpu_is_offline()
      check.  It also invokes tick_nohz_idle_stop_tick() instead of
      tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
      interrupt.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      51111023
    • P
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · 9addfbd4
      Phil Auld 提交于
      [ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ]
      
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      9addfbd4
    • J
      sched/deadline: Fix bandwidth accounting at all levels after offline migration · 0f308569
      Juri Lelli 提交于
      [ Upstream commit 59d06cea1198d665ba11f7e8c5f45b00ff2e4812 ]
      
      If a task happens to be throttled while the CPU it was running on gets
      hotplugged off, the bandwidth associated with the task is not correctly
      migrated with it when the replenishment timer fires (offline_migration).
      
      Fix things up, for this_bw, running_bw and total_bw, when replenishment
      timer fires and task is migrated (dl_task_offline_migration()).
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: rostedt@goodmis.org
      Cc: tj@kernel.org
      Cc: tommaso.cucinotta@santannapisa.it
      Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f308569
    • J
      sched/core: Fix CPU controller for !RT_GROUP_SCHED · f381d3d2
      Juri Lelli 提交于
      [ Upstream commit a07db5c0865799ebed1f88be0df50c581fb65029 ]
      
      On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
      move RT tasks between cgroups to which CPU controller has been attached;
      but it is oddly possible to first move tasks around and then make them
      RT (setschedule to FIFO/RR).
      
      E.g.:
      
        # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
        # chrt -fp 10 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        bash: echo: write error: Invalid argument
        # chrt -op 0 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        # chrt -fp 10 $$
        # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        2345
        2598
        # chrt -p 2345
        pid 2345's current scheduling policy: SCHED_FIFO
        pid 2345's current scheduling priority: 10
      
      Also, as Michal noted, it is currently not possible to enable CPU
      controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
      are any kernel RT threads in root cgroup, they can't be migrated to the
      newly created CPU controller's root in cgroup_update_dfl_csses()).
      
      Existing code comes with a comment saying the "we don't support RT-tasks
      being in separate groups". Such comment is however stale and belongs to
      pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
      !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
      are not performed at all in these cases.
      
      Make moving RT tasks between CPU controller groups viable by removing
      special case check for RT (and DEADLINE) tasks.
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f381d3d2
    • V
      sched/fair: Fix imbalance due to CPU affinity · 417cf53b
      Vincent Guittot 提交于
      [ Upstream commit f6cad8df6b30a5d2bbbd2e698f74b4cafb9fb82b ]
      
      The load_balance() has a dedicated mecanism to detect when an imbalance
      is due to CPU affinity and must be handled at parent level. In this case,
      the imbalance field of the parent's sched_group is set.
      
      The description of sg_imbalanced() gives a typical example of two groups
      of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
      group and 3 CPUs of the second group. Something like:
      
      	{ 0 1 2 3 } { 4 5 6 7 }
      	        *     * * *
      
      But the load_balance fails to fix this UC on my octo cores system
      made of 2 clusters of quad cores.
      
      Whereas the load_balance is able to detect that the imbalanced is due to
      CPU affinity, it fails to fix it because the imbalance field is cleared
      before letting parent level a chance to run. In fact, when the imbalance is
      detected, the load_balance reruns without the CPU with pinned tasks. But
      there is no other running tasks in the situation described above and
      everything looks balanced this time so the imbalance field is immediately
      cleared.
      
      The imbalance field should not be cleared if there is no other task to move
      when the imbalance is detected.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      417cf53b
    • P
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint · 7cebdfa6
      Paul E. McKenney 提交于
      [ Upstream commit 84ec3a0787086fcd25f284f59b3aa01fd6fc0a5d ]
      
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
      
      The TASKS03 and TREE04 rcutorture scenarios produce the following
      lockdep complaint:
      
      	WARNING: inconsistent lock state
      	5.2.0-rc1+ #513 Not tainted
      	--------------------------------
      	inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      	migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
      	(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
      	{IN-HARDIRQ-W} state was registered at:
      	  lock_acquire+0xb0/0x1c0
      	  _raw_spin_lock_irqsave+0x3c/0x50
      	  tick_broadcast_switch_to_oneshot+0xd/0x40
      	  tick_switch_to_oneshot+0x4f/0xd0
      	  hrtimer_run_queues+0xf3/0x130
      	  run_local_timers+0x1c/0x50
      	  update_process_times+0x1c/0x50
      	  tick_periodic+0x26/0xc0
      	  tick_handle_periodic+0x1a/0x60
      	  smp_apic_timer_interrupt+0x80/0x2a0
      	  apic_timer_interrupt+0xf/0x20
      	  _raw_spin_unlock_irqrestore+0x4e/0x60
      	  rcu_nocb_gp_kthread+0x15d/0x590
      	  kthread+0xf3/0x130
      	  ret_from_fork+0x3a/0x50
      	irq event stamp: 171
      	hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
      	hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
      	softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
      	softirqs last disabled at (0): [<0000000000000000>] 0x0
      
              [...]
      
      To reproduce, run the following rcutorture test:
      
       $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
      
      It turns out that tick_broadcast_offline() was an innocent bystander.
      After all, interrupts are supposed to be disabled throughout
      take_cpu_down(), and therefore should have been disabled upon entry to
      tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
      that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
      and leaving them enabled on return.
      
      Some debugging code showed that the culprit was sched_cpu_dying().
      It had irqs enabled after return from sched_tick_stop().  Which in turn
      had irqs enabled after return from cancel_delayed_work_sync().  Which is a
      wrapper around __cancel_work_timer().  Which can sleep in the case where
      something else is concurrently trying to cancel the same delayed work,
      and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
      idea when you are invoked from take_cpu_down(), regardless of the state
      you leave interrupts in upon return.
      
      Code inspection located no reason why the delayed work absolutely
      needed to be canceled from sched_tick_stop():  The work is not
      bound to the outgoing CPU by design, given that the whole point is
      to collect statistics without disturbing the outgoing CPU.
      
      This commit therefore simply drops the cancel_delayed_work_sync() from
      sched_tick_stop().  Instead, a new ->state field is added to the tick_work
      structure so that the delayed-work handler function sched_tick_remote()
      can avoid reposting itself.  A cpu_is_offline() check is also added to
      sched_tick_remote() to avoid mucking with the state of an offlined CPU
      (though it does appear safe to do so).  The sched_tick_start() and
      sched_tick_stop() functions also update ->state, and sched_tick_start()
      also schedules the delayed work if ->state indicates that it is not
      already in flight.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7cebdfa6
  5. 16 9月, 2019 1 次提交
  6. 25 8月, 2019 1 次提交
  7. 04 8月, 2019 2 次提交
  8. 26 7月, 2019 2 次提交
  9. 04 6月, 2019 1 次提交
  10. 31 5月, 2019 4 次提交
  11. 26 5月, 2019 1 次提交
  12. 02 5月, 2019 2 次提交
  13. 27 4月, 2019 1 次提交
    • P
      sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup · c3edd427
      Phil Auld 提交于
      [ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ]
      
      With extremely short cfs_period_us setting on a parent task group with a large
      number of children the for loop in sched_cfs_period_timer() can run until the
      watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
      will ever return 0.  The large number of children can make
      do_sched_cfs_period_timer() take longer than the period.
      
       NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
       RIP: 0010:tg_nop+0x0/0x10
        <IRQ>
        walk_tg_tree_from+0x29/0xb0
        unthrottle_cfs_rq+0xe0/0x1a0
        distribute_cfs_runtime+0xd3/0xf0
        sched_cfs_period_timer+0xcb/0x160
        ? sched_cfs_slack_timer+0xd0/0xd0
        __hrtimer_run_queues+0xfb/0x270
        hrtimer_interrupt+0x122/0x270
        smp_apic_timer_interrupt+0x6a/0x140
        apic_timer_interrupt+0xf/0x20
        </IRQ>
      
      To prevent this we add protection to the loop that detects when the loop has run
      too many times and scales the period and quota up, proportionally, so that the timer
      can complete before then next period expires.  This preserves the relative runtime
      quota while preventing the hard lockup.
      
      A warning is issued reporting this state and the new values.
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c3edd427
  14. 20 4月, 2019 2 次提交
  15. 17 4月, 2019 1 次提交
    • M
      sched/fair: Do not re-read ->h_load_next during hierarchical load calculation · cb75a0c5
      Mel Gorman 提交于
      commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.
      
      A NULL pointer dereference bug was reported on a distribution kernel but
      the same issue should be present on mainline kernel. It occured on s390
      but should not be arch-specific.  A partial oops looks like:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        ...
        Call Trace:
          ...
          try_to_wake_up+0xfc/0x450
          vhost_poll_wakeup+0x3a/0x50 [vhost]
          __wake_up_common+0xbc/0x178
          __wake_up_common_lock+0x9e/0x160
          __wake_up_sync_key+0x4e/0x60
          sock_def_readable+0x5e/0x98
      
      The bug hits any time between 1 hour to 3 days. The dereference occurs
      in update_cfs_rq_h_load when accumulating h_load. The problem is that
      cfq_rq->h_load_next is not protected by any locking and can be updated
      by parallel calls to task_h_load. Depending on the compiler, code may be
      generated that re-reads cfq_rq->h_load_next after the check for NULL and
      then oops when reading se->avg.load_avg. The dissassembly showed that it
      was possible to reread h_load_next after the check for NULL.
      
      While this does not appear to be an issue for later compilers, it's still
      an accident if the correct code is generated. Full locking in this path
      would have high overhead so this patch uses READ_ONCE to read h_load_next
      only once and check for NULL before dereferencing. It was confirmed that
      there were no further oops after 10 days of testing.
      
      As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
      potential problems with store tearing.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@vger.kernel.org>
      Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()")
      Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb75a0c5
  16. 06 4月, 2019 3 次提交
  17. 06 3月, 2019 1 次提交
    • P
      sched/wake_q: Fix wakeup ordering for wake_q · 653a1dbc
      Peter Zijlstra 提交于
      [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      Reported-by: NYongji Xie <elohimes@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      653a1dbc
  18. 13 2月, 2019 1 次提交
    • J
      cpu/hotplug: Fix "SMT disabled by BIOS" detection for KVM · 97a7fa90
      Josh Poimboeuf 提交于
      commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.
      
      With the following commit:
      
        73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      
      ... the hotplug code attempted to detect when SMT was disabled by BIOS,
      in which case it reported SMT as permanently disabled.  However, that
      code broke a virt hotplug scenario, where the guest is booted with only
      primary CPU threads, and a sibling is brought online later.
      
      The problem is that there doesn't seem to be a way to reliably
      distinguish between the HW "SMT disabled by BIOS" case and the virt
      "sibling not yet brought online" case.  So the above-mentioned commit
      was a bit misguided, as it permanently disabled SMT for both cases,
      preventing future virt sibling hotplugs.
      
      Going back and reviewing the original problems which were attempted to
      be solved by that commit, when SMT was disabled in BIOS:
      
        1) /sys/devices/system/cpu/smt/control showed "on" instead of
           "notsupported"; and
      
        2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.
      
      I'd propose that we instead consider #1 above to not actually be a
      problem.  Because, at least in the virt case, it's possible that SMT
      wasn't disabled by BIOS and a sibling thread could be brought online
      later.  So it makes sense to just always default the smt control to "on"
      to allow for that possibility (assuming cpuid indicates that the CPU
      supports SMT).
      
      The real problem is #2, which has a simple fix: change vmx_vm_init() to
      query the actual current SMT state -- i.e., whether any siblings are
      currently online -- instead of looking at the SMT "control" sysfs value.
      
      So fix it by:
      
        a) reverting the original "fix" and its followup fix:
      
           73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
           bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation")
      
           and
      
        b) changing vmx_vm_init() to query the actual current SMT state --
           instead of the sysfs control value -- to determine whether the L1TF
           warning is needed.  This also requires the 'sched_smt_present'
           variable to exported, instead of 'cpu_smt_control'.
      
      Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS")
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: kvm@vger.kernel.org
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      97a7fa90
  19. 13 1月, 2019 1 次提交
    • L
      sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654 · dc8408ea
      Linus Torvalds 提交于
      commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.
      
      Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
      scheduler under high loads, starting at around the v4.18 time frame,
      and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
      manipulation.
      
      Do a (manual) revert of:
      
        a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      
      It turns out that the list_del_leaf_cfs_rq() introduced by this commit
      is a surprising property that was not considered in followup commits
      such as:
      
        9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")
      
      As Vincent Guittot explains:
      
       "I think that there is a bigger problem with commit a9e7f654 and
        cfs_rq throttling:
      
        Let take the example of the following topology TG2 --> TG1 --> root:
      
         1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
            cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
            one path because it has never been used and can't be throttled so
            tmp_alone_branch will point to leaf_cfs_rq_list at the end.
      
         2) Then TG1 is throttled
      
         3) and we add TG3 as a new child of TG1.
      
         4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
            cfs_rq and tmp_alone_branch will stay  on rq->leaf_cfs_rq_list.
      
        With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list.
        So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
        cfs_rq is removed from the list.
        Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
        but tmp_alone_branch still points to TG3 cfs_rq because its throttled
        parent can't be enqueued when the lock is released.
        tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.
      
        So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
        points on another TG cfs_rq, the next TG cfs_rq that will be added,
        will be linked outside rq->leaf_cfs_rq_list - which is bad.
      
        In addition, we can break the ordering of the cfs_rq in
        rq->leaf_cfs_rq_list but this ordering is used to update and
        propagate the update from leaf down to root."
      
      Instead of trying to work through all these cases and trying to reproduce
      the very high loads that produced the lockup to begin with, simplify
      the code temporarily by reverting a9e7f654 - which change was clearly
      not thought through completely.
      
      This (hopefully) gives us a kernel that doesn't lock up so people
      can continue to enjoy their holidays without worrying about regressions. ;-)
      
      [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]
      Analyzed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Analyzed-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Reported-by: NSargun Dhillon <sargun@sargun.me>
      Reported-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NZhipeng Xie <xiezhipeng1@huawei.com>
      Tested-by: NSargun Dhillon <sargun@sargun.me>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: <stable@vger.kernel.org> # v4.13+
      Cc: Bin Li <huawei.libin@huawei.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path")
      Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dc8408ea
  20. 20 12月, 2018 1 次提交
  21. 06 12月, 2018 2 次提交
    • T
      sched/smt: Expose sched_smt_present static key · 340693ee
      Thomas Gleixner 提交于
      commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream
      
      Make the scheduler's 'sched_smt_present' static key globaly available, so
      it can be used in the x86 speculation control code.
      
      Provide a query function and a stub for the CONFIG_SMP=n case.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340693ee
    • P
      sched/smt: Make sched_smt_present track topology · a2c09481
      Peter Zijlstra (Intel) 提交于
      commit c5511d03ec090980732e929c318a7a6374b5550e upstream
      
      Currently the 'sched_smt_present' static key is enabled when at CPU bringup
      SMT topology is observed, but it is never disabled. However there is demand
      to also disable the key when the topology changes such that there is no SMT
      present anymore.
      
      Implement this by making the key count the number of cores that have SMT
      enabled.
      
      In particular, the SMT topology bits are set before interrrupts are enabled
      and similarly, are cleared after interrupts are disabled for the last time
      and the CPU dies.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2c09481
  22. 01 12月, 2018 1 次提交
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  23. 27 11月, 2018 1 次提交
    • V
      sched/core: Take the hotplug lock in sched_init_smp() · b1e814e4
      Valentin Schneider 提交于
      [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]
      
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1e814e4