1. 25 12月, 2019 1 次提交
  2. 01 11月, 2019 3 次提交
    • Q
      sched/fair: Fix -Wunused-but-set-variable warnings · 793ddb52
      Qian Cai 提交于
      commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream.
      
      Commit:
      
         de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      
      introduced a few compilation warnings:
      
        kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
        kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
        kernel/sched/fair.c: In function 'start_cfs_bandwidth':
        kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]
      
      Also, __refill_cfs_bandwidth_runtime() does no longer update the
      expiration time, so fix the comments accordingly.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pauld@redhat.com
      Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      793ddb52
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 192fa322
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
              /* extend local deadline, drift is bounded above by 2 ticks */
              cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      192fa322
    • B
      sched/fair: Don't push cfs_bandwith slack timers forward · 9a99f90a
      bsegall@google.com 提交于
      commit 66567fcbaecac455caa1b13643155d686b51ce63 upstream.
      
      When a cfs_rq sleeps and returns its quota, we delay for 5ms before
      waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
      sleep, as this has to be done outside of the rq lock we hold.
      
      The current code waits for 5ms without any sleeps, instead of waiting
      for 5ms from the first sleep, which can delay the unthrottle more than
      we want. Switch this around so that we can't push this forward forever.
      
      This requires an extra flag rather than using hrtimer_active, since we
      need to start a new timer if the current one is in the process of
      finishing.
      Signed-off-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
      9a99f90a
  3. 30 10月, 2019 22 次提交
  4. 12 10月, 2019 2 次提交
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 46ff0e2f
      KeMeng Shi 提交于
      [ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46ff0e2f
    • M
      sched/membarrier: Fix private expedited registration check · 6cb7aa1b
      Mathieu Desnoyers 提交于
      [ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ]
      
      Fix a logic flaw in the way membarrier_register_private_expedited()
      handles ready state checks for private expedited sync core and private
      expedited registrations.
      
      If a private expedited membarrier registration is first performed, and
      then a private expedited sync_core registration is performed, the ready
      state check will skip the second registration when it really should not.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cb7aa1b
  5. 05 10月, 2019 7 次提交
    • D
      sched/cpufreq: Align trace event behavior of fast switching · 01e8f487
      Douglas RAILLARD 提交于
      [ Upstream commit 77c84dd1881d0f0176cb678d770bfbda26c54390 ]
      
      Fast switching path only emits an event for the CPU of interest, whereas the
      regular path emits an event for all the CPUs that had their frequency changed,
      i.e. all the CPUs sharing the same policy.
      
      With the current behavior, looking at cpu_frequency event for a given CPU that
      is using the fast switching path will not give the correct frequency signal.
      Signed-off-by: NDouglas RAILLARD <douglas.raillard@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      01e8f487
    • P
      idle: Prevent late-arriving interrupts from disrupting offline · 51111023
      Peter Zijlstra 提交于
      [ Upstream commit e78a7614f3876ac649b3df608789cb6ef74d0480 ]
      
      Scheduling-clock interrupts can arrive late in the CPU-offline process,
      after idle entry and the subsequent call to cpuhp_report_idle_dead().
      Once execution passes the call to rcu_report_dead(), RCU is ignoring
      the CPU, which results in lockdep complaints when the interrupt handler
      uses RCU:
      
      ------------------------------------------------------------------------
      
      =============================
      WARNING: suspicious RCU usage
      5.2.0-rc1+ #681 Not tainted
      -----------------------------
      kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from offline CPU!
      rcu_scheduler_active = 2, debug_locks = 1
      no locks held by swapper/5/0.
      
      stack backtrace:
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       dump_stack+0x5e/0x8b
       trigger_load_balance+0xa8/0x390
       ? tick_sched_do_timer+0x60/0x60
       update_process_times+0x3b/0x50
       tick_sched_handle+0x2f/0x40
       tick_sched_timer+0x32/0x70
       __hrtimer_run_queues+0xd3/0x3b0
       hrtimer_interrupt+0x11d/0x270
       ? sched_clock_local+0xc/0x74
       smp_apic_timer_interrupt+0x79/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:delay_tsc+0x22/0x50
      Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
      RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
      RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
      RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
       cpuhp_report_idle_dead+0x31/0x60
       do_idle+0x1d5/0x200
       ? _raw_spin_unlock_irqrestore+0x2d/0x40
       cpu_startup_entry+0x14/0x20
       start_secondary+0x151/0x170
       secondary_startup_64+0xa4/0xb0
      
      ------------------------------------------------------------------------
      
      This happens rarely, but can be forced by happen more often by
      placing delays in cpuhp_report_idle_dead() following the call to
      rcu_report_dead().  With this in place, the following rcutorture
      scenario reproduces the problem within a few minutes:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"
      
      This commit uses the crude but effective expedient of moving the disabling
      of interrupts within the idle loop to precede the cpu_is_offline()
      check.  It also invokes tick_nohz_idle_stop_tick() instead of
      tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
      interrupt.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      51111023
    • P
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · 9addfbd4
      Phil Auld 提交于
      [ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ]
      
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      9addfbd4
    • J
      sched/deadline: Fix bandwidth accounting at all levels after offline migration · 0f308569
      Juri Lelli 提交于
      [ Upstream commit 59d06cea1198d665ba11f7e8c5f45b00ff2e4812 ]
      
      If a task happens to be throttled while the CPU it was running on gets
      hotplugged off, the bandwidth associated with the task is not correctly
      migrated with it when the replenishment timer fires (offline_migration).
      
      Fix things up, for this_bw, running_bw and total_bw, when replenishment
      timer fires and task is migrated (dl_task_offline_migration()).
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: rostedt@goodmis.org
      Cc: tj@kernel.org
      Cc: tommaso.cucinotta@santannapisa.it
      Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f308569
    • J
      sched/core: Fix CPU controller for !RT_GROUP_SCHED · f381d3d2
      Juri Lelli 提交于
      [ Upstream commit a07db5c0865799ebed1f88be0df50c581fb65029 ]
      
      On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
      move RT tasks between cgroups to which CPU controller has been attached;
      but it is oddly possible to first move tasks around and then make them
      RT (setschedule to FIFO/RR).
      
      E.g.:
      
        # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
        # chrt -fp 10 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        bash: echo: write error: Invalid argument
        # chrt -op 0 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        # chrt -fp 10 $$
        # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        2345
        2598
        # chrt -p 2345
        pid 2345's current scheduling policy: SCHED_FIFO
        pid 2345's current scheduling priority: 10
      
      Also, as Michal noted, it is currently not possible to enable CPU
      controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
      are any kernel RT threads in root cgroup, they can't be migrated to the
      newly created CPU controller's root in cgroup_update_dfl_csses()).
      
      Existing code comes with a comment saying the "we don't support RT-tasks
      being in separate groups". Such comment is however stale and belongs to
      pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
      !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
      are not performed at all in these cases.
      
      Make moving RT tasks between CPU controller groups viable by removing
      special case check for RT (and DEADLINE) tasks.
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f381d3d2
    • V
      sched/fair: Fix imbalance due to CPU affinity · 417cf53b
      Vincent Guittot 提交于
      [ Upstream commit f6cad8df6b30a5d2bbbd2e698f74b4cafb9fb82b ]
      
      The load_balance() has a dedicated mecanism to detect when an imbalance
      is due to CPU affinity and must be handled at parent level. In this case,
      the imbalance field of the parent's sched_group is set.
      
      The description of sg_imbalanced() gives a typical example of two groups
      of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first
      group and 3 CPUs of the second group. Something like:
      
      	{ 0 1 2 3 } { 4 5 6 7 }
      	        *     * * *
      
      But the load_balance fails to fix this UC on my octo cores system
      made of 2 clusters of quad cores.
      
      Whereas the load_balance is able to detect that the imbalanced is due to
      CPU affinity, it fails to fix it because the imbalance field is cleared
      before letting parent level a chance to run. In fact, when the imbalance is
      detected, the load_balance reruns without the CPU with pinned tasks. But
      there is no other running tasks in the situation described above and
      everything looks balanced this time so the imbalance field is immediately
      cleared.
      
      The imbalance field should not be cleared if there is no other task to move
      when the imbalance is detected.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      417cf53b
    • P
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint · 7cebdfa6
      Paul E. McKenney 提交于
      [ Upstream commit 84ec3a0787086fcd25f284f59b3aa01fd6fc0a5d ]
      
      time/tick-broadcast: Fix tick_broadcast_offline() lockdep complaint
      
      The TASKS03 and TREE04 rcutorture scenarios produce the following
      lockdep complaint:
      
      	WARNING: inconsistent lock state
      	5.2.0-rc1+ #513 Not tainted
      	--------------------------------
      	inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      	migration/1/14 [HC0[0]:SC0[0]:HE1:SE1] takes:
      	(____ptrval____) (tick_broadcast_lock){?...}, at: tick_broadcast_offline+0xf/0x70
      	{IN-HARDIRQ-W} state was registered at:
      	  lock_acquire+0xb0/0x1c0
      	  _raw_spin_lock_irqsave+0x3c/0x50
      	  tick_broadcast_switch_to_oneshot+0xd/0x40
      	  tick_switch_to_oneshot+0x4f/0xd0
      	  hrtimer_run_queues+0xf3/0x130
      	  run_local_timers+0x1c/0x50
      	  update_process_times+0x1c/0x50
      	  tick_periodic+0x26/0xc0
      	  tick_handle_periodic+0x1a/0x60
      	  smp_apic_timer_interrupt+0x80/0x2a0
      	  apic_timer_interrupt+0xf/0x20
      	  _raw_spin_unlock_irqrestore+0x4e/0x60
      	  rcu_nocb_gp_kthread+0x15d/0x590
      	  kthread+0xf3/0x130
      	  ret_from_fork+0x3a/0x50
      	irq event stamp: 171
      	hardirqs last  enabled at (171): [<ffffffff8a201a37>] trace_hardirqs_on_thunk+0x1a/0x1c
      	hardirqs last disabled at (170): [<ffffffff8a201a53>] trace_hardirqs_off_thunk+0x1a/0x1c
      	softirqs last  enabled at (0): [<ffffffff8a264ee0>] copy_process.part.56+0x650/0x1cb0
      	softirqs last disabled at (0): [<0000000000000000>] 0x0
      
              [...]
      
      To reproduce, run the following rcutorture test:
      
       $ tools/testing/selftests/rcutorture/bin/kvm.sh --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TASKS03 TREE04"
      
      It turns out that tick_broadcast_offline() was an innocent bystander.
      After all, interrupts are supposed to be disabled throughout
      take_cpu_down(), and therefore should have been disabled upon entry to
      tick_offline_cpu() and thus to tick_broadcast_offline().  This suggests
      that one of the CPU-hotplug notifiers was incorrectly enabling interrupts,
      and leaving them enabled on return.
      
      Some debugging code showed that the culprit was sched_cpu_dying().
      It had irqs enabled after return from sched_tick_stop().  Which in turn
      had irqs enabled after return from cancel_delayed_work_sync().  Which is a
      wrapper around __cancel_work_timer().  Which can sleep in the case where
      something else is concurrently trying to cancel the same delayed work,
      and as Thomas Gleixner pointed out on IRC, sleeping is a decidedly bad
      idea when you are invoked from take_cpu_down(), regardless of the state
      you leave interrupts in upon return.
      
      Code inspection located no reason why the delayed work absolutely
      needed to be canceled from sched_tick_stop():  The work is not
      bound to the outgoing CPU by design, given that the whole point is
      to collect statistics without disturbing the outgoing CPU.
      
      This commit therefore simply drops the cancel_delayed_work_sync() from
      sched_tick_stop().  Instead, a new ->state field is added to the tick_work
      structure so that the delayed-work handler function sched_tick_remote()
      can avoid reposting itself.  A cpu_is_offline() check is also added to
      sched_tick_remote() to avoid mucking with the state of an offlined CPU
      (though it does appear safe to do so).  The sched_tick_start() and
      sched_tick_stop() functions also update ->state, and sched_tick_start()
      also schedules the delayed work if ->state indicates that it is not
      already in flight.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      [ paulmck: Apply Peter Zijlstra and Frederic Weisbecker atomics feedback. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NFrederic Weisbecker <frederic@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190625165238.GJ26519@linux.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7cebdfa6
  6. 16 9月, 2019 1 次提交
  7. 25 8月, 2019 1 次提交
  8. 04 8月, 2019 2 次提交
  9. 26 7月, 2019 1 次提交
    • Q
      sched/fair: Fix "runnable_avg_yN_inv" not used warnings · 7fc96cd2
      Qian Cai 提交于
      [ Upstream commit 509466b7d480bc5d22e90b9fbe6122ae0e2fbe39 ]
      
      runnable_avg_yN_inv[] is only used in kernel/sched/pelt.c but was
      included in several other places because they need other macros all
      came from kernel/sched/sched-pelt.h which was generated by
      Documentation/scheduler/sched-pelt. As the result, it causes compilation
      a lot of warnings,
      
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        kernel/sched/sched-pelt.h:4:18: warning: 'runnable_avg_yN_inv' defined but not used [-Wunused-const-variable=]
        ...
      
      Silence it by appending the __maybe_unused attribute for it, so all
      generated variables and macros can still be kept in the same file.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1559596304-31581-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7fc96cd2