1. 17 1月, 2020 2 次提交
  2. 27 12月, 2019 23 次提交
  3. 13 12月, 2019 2 次提交
    • X
      sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision · 742f2319
      Xuewei Zhang 提交于
      commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream.
      
      The quota/period ratio is used to ensure a child task group won't get
      more bandwidth than the parent task group, and is calculated as:
      
        normalized_cfs_quota() = [(quota_us << 20) / period_us]
      
      If the quota/period ratio was changed during this scaling due to
      precision loss, it will cause inconsistency between parent and child
      task groups.
      
      See below example:
      
      A userspace container manager (kubelet) does three operations:
      
       1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
       2) Create a few children cgroups.
       3) Set quota to 1,000us and period to 10,000us on a child cgroup.
      
      These operations are expected to succeed. However, if the scaling of
      147/128 happens before step 3, quota and period of the parent cgroup
      will be changed:
      
        new_quota: 1148437ns,   1148us
       new_period: 11484375ns, 11484us
      
      And when step 3 comes in, the ratio of the child cgroup will be
      104857, which will be larger than the parent cgroup ratio (104821),
      and will fail.
      
      Scaling them by a factor of 2 will fix the problem.
      Tested-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NXuewei Zhang <xueweiz@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPhil Auld <pauld@redhat.com>
      Cc: Anton Blanchard <anton@ozlabs.org>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
      Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      742f2319
    • P
      sched/core: Avoid spurious lock dependencies · 870083b6
      Peter Zijlstra 提交于
      [ Upstream commit ff51ff84d82aea5a889b85f2b9fb3aa2b8691668 ]
      
      While seemingly harmless, __sched_fork() does hrtimer_init(), which,
      when DEBUG_OBJETS, can end up doing allocations.
      
      This then results in the following lock order:
      
        rq->lock
          zone->lock.rlock
            batched_entropy_u64.lock
      
      Which in turn causes deadlocks when we do wakeups while holding that
      batched_entropy lock -- as the random code does.
      
      Solve this by moving __sched_fork() out from under rq->lock. This is
      safe because nothing there relies on rq->lock, as also evident from the
      other __sched_fork() callsite.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: bigeasy@linutronix.de
      Cc: cl@linux.com
      Cc: keescook@chromium.org
      Cc: penberg@kernel.org
      Cc: rientjes@google.com
      Cc: thgarnie@google.com
      Cc: tytso@mit.edu
      Cc: will@kernel.org
      Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy")
      Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      870083b6
  4. 01 12月, 2019 2 次提交
  5. 21 11月, 2019 1 次提交
  6. 13 11月, 2019 2 次提交
    • Q
      sched/fair: Fix -Wunused-but-set-variable warnings · e9c0fc4a
      Qian Cai 提交于
      commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream.
      
      Commit:
      
         de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      
      introduced a few compilation warnings:
      
        kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
        kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
        kernel/sched/fair.c: In function 'start_cfs_bandwidth':
        kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]
      
      Also, __refill_cfs_bandwidth_runtime() does no longer update the
      expiration time, so fix the comments accordingly.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pauld@redhat.com
      Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
      Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      e9c0fc4a
    • D
      sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 502bd151
      Dave Chiluk 提交于
      commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.
      
      It has been observed, that highly-threaded, non-cpu-bound applications
      running under cpu.cfs_quota_us constraints can hit a high percentage of
      periods throttled while simultaneously not consuming the allocated
      amount of quota. This use case is typical of user-interactive non-cpu
      bound applications, such as those running in kubernetes or mesos when
      run on multiple cpu cores.
      
      This has been root caused to cpu-local run queue being allocated per cpu
      bandwidth slices, and then not fully using that slice within the period.
      At which point the slice and quota expires. This expiration of unused
      slice results in applications not being able to utilize the quota for
      which they are allocated.
      
      The non-expiration of per-cpu slices was recently fixed by
      'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
      condition")'. Prior to that it appears that this had been broken since
      at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
      cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
      added the following conditional which resulted in slices never being
      expired.
      
      if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
      	/* extend local deadline, drift is bounded above by 2 ticks */
      	cfs_rq->runtime_expires += TICK_NSEC;
      
      Because this was broken for nearly 5 years, and has recently been fixed
      and is now being noticed by many users running kubernetes
      (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
      that the mechanisms around expiring runtime should be removed
      altogether.
      
      This allows quota already allocated to per-cpu run-queues to live longer
      than the period boundary. This allows threads on runqueues that do not
      use much CPU to continue to use their remaining slice over a longer
      period of time than cpu.cfs_period_us. However, this helps prevent the
      above condition of hitting throttling while also not fully utilizing
      your cpu quota.
      
      This theoretically allows a machine to use slightly more than its
      allotted quota in some periods. This overflow would be bounded by the
      remaining quota left on each per-cpu runqueueu. This is typically no
      more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
      change nothing, as they should theoretically fully utilize all of their
      quota in each period. For user-interactive tasks as described above this
      provides a much better user/application experience as their cpu
      utilization will more closely match the amount they requested when they
      hit throttling. This means that cpu limits no longer strictly apply per
      period for non-cpu bound applications, but that they are still accurate
      over longer timeframes.
      
      This greatly improves performance of high-thread-count, non-cpu bound
      applications with low cfs_quota_us allocation on high-core-count
      machines. In the case of an artificial testcase (10ms/100ms of quota on
      80 CPU machine), this commit resulted in almost 30x performance
      improvement, while still maintaining correct cpu quota restrictions.
      That testcase is available at https://github.com/indeedeng/fibtest.
      
      Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
      Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: John Hammond <jhammond@indeed.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kyle Anderson <kwa@yelp.com>
      Cc: Gabriel Munos <gmunoz@netflix.com>
      Cc: Peter Oskolkov <posk@posk.io>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Brendan Gregg <bgregg@netflix.com>
      Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      502bd151
  7. 06 11月, 2019 1 次提交
    • F
      sched/vtime: Fix guest/system mis-accounting on task switch · 2cd003a8
      Frederic Weisbecker 提交于
      [ Upstream commit 68e7a4d66b0ce04bf18ff2ffded5596ab3618585 ]
      
      vtime_account_system() assumes that the target task to account cputime
      to is always the current task. This is most often true indeed except on
      task switch where we call:
      
      	vtime_common_task_switch(prev)
      		vtime_account_system(prev)
      
      Here prev is the scheduling-out task where we account the cputime to. It
      doesn't match current that is already the scheduling-in task at this
      stage of the context switch.
      
      So we end up checking the wrong task flags to determine if we are
      accounting guest or system time to the previous task.
      
      As a result the wrong task is used to check if the target is running in
      guest mode. We may then spuriously account or leak either system or
      guest time on task switch.
      
      Fix this assumption and also turn vtime_guest_enter/exit() to use the
      task passed in parameter as well to avoid future similar issues.
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Fixes: 2a42eb95 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
      Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2cd003a8
  8. 12 10月, 2019 2 次提交
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · 46ff0e2f
      KeMeng Shi 提交于
      [ Upstream commit 714e501e16cd473538b609b3e351b2cc9f7f09ed ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      46ff0e2f
    • M
      sched/membarrier: Fix private expedited registration check · 6cb7aa1b
      Mathieu Desnoyers 提交于
      [ Upstream commit fc0d77387cb5ae883fd774fc559e056a8dde024c ]
      
      Fix a logic flaw in the way membarrier_register_private_expedited()
      handles ready state checks for private expedited sync core and private
      expedited registrations.
      
      If a private expedited membarrier registration is first performed, and
      then a private expedited sync_core registration is performed, the ready
      state check will skip the second registration when it really should not.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6cb7aa1b
  9. 05 10月, 2019 5 次提交
    • D
      sched/cpufreq: Align trace event behavior of fast switching · 01e8f487
      Douglas RAILLARD 提交于
      [ Upstream commit 77c84dd1881d0f0176cb678d770bfbda26c54390 ]
      
      Fast switching path only emits an event for the CPU of interest, whereas the
      regular path emits an event for all the CPUs that had their frequency changed,
      i.e. all the CPUs sharing the same policy.
      
      With the current behavior, looking at cpu_frequency event for a given CPU that
      is using the fast switching path will not give the correct frequency signal.
      Signed-off-by: NDouglas RAILLARD <douglas.raillard@arm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      01e8f487
    • P
      idle: Prevent late-arriving interrupts from disrupting offline · 51111023
      Peter Zijlstra 提交于
      [ Upstream commit e78a7614f3876ac649b3df608789cb6ef74d0480 ]
      
      Scheduling-clock interrupts can arrive late in the CPU-offline process,
      after idle entry and the subsequent call to cpuhp_report_idle_dead().
      Once execution passes the call to rcu_report_dead(), RCU is ignoring
      the CPU, which results in lockdep complaints when the interrupt handler
      uses RCU:
      
      ------------------------------------------------------------------------
      
      =============================
      WARNING: suspicious RCU usage
      5.2.0-rc1+ #681 Not tainted
      -----------------------------
      kernel/sched/fair.c:9542 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      RCU used illegally from offline CPU!
      rcu_scheduler_active = 2, debug_locks = 1
      no locks held by swapper/5/0.
      
      stack backtrace:
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.2.0-rc1+ #681
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
      Call Trace:
       <IRQ>
       dump_stack+0x5e/0x8b
       trigger_load_balance+0xa8/0x390
       ? tick_sched_do_timer+0x60/0x60
       update_process_times+0x3b/0x50
       tick_sched_handle+0x2f/0x40
       tick_sched_timer+0x32/0x70
       __hrtimer_run_queues+0xd3/0x3b0
       hrtimer_interrupt+0x11d/0x270
       ? sched_clock_local+0xc/0x74
       smp_apic_timer_interrupt+0x79/0x200
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:delay_tsc+0x22/0x50
      Code: ff 0f 1f 80 00 00 00 00 65 44 8b 05 18 a7 11 48 0f ae e8 0f 31 48 89 d6 48 c1 e6 20 48 09 c6 eb 0e f3 90 65 8b 05 fe a6 11 48 <41> 39 c0 75 18 0f ae e8 0f 31 48 c1 e2 20 48 09 c2 48 89 d0 48 29
      RSP: 0000:ffff8f92c0157ed0 EFLAGS: 00000212 ORIG_RAX: ffffffffffffff13
      RAX: 0000000000000005 RBX: ffff8c861f356400 RCX: ffff8f92c0157e64
      RDX: 000000321214c8cc RSI: 00000032120daa7f RDI: 0000000000260f15
      RBP: 0000000000000005 R08: 0000000000000005 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff8c861ee18000 R15: ffff8c861ee18000
       cpuhp_report_idle_dead+0x31/0x60
       do_idle+0x1d5/0x200
       ? _raw_spin_unlock_irqrestore+0x2d/0x40
       cpu_startup_entry+0x14/0x20
       start_secondary+0x151/0x170
       secondary_startup_64+0xa4/0xb0
      
      ------------------------------------------------------------------------
      
      This happens rarely, but can be forced by happen more often by
      placing delays in cpuhp_report_idle_dead() following the call to
      rcu_report_dead().  With this in place, the following rcutorture
      scenario reproduces the problem within a few minutes:
      
      tools/testing/selftests/rcutorture/bin/kvm.sh --cpus 8 --duration 5 --kconfig "CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y" --configs "TREE04"
      
      This commit uses the crude but effective expedient of moving the disabling
      of interrupts within the idle loop to precede the cpu_is_offline()
      check.  It also invokes tick_nohz_idle_stop_tick() instead of
      tick_nohz_idle_stop_tick_protected() to shut off the scheduling-clock
      interrupt.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      [ paulmck: Revert tick_nohz_idle_stop_tick_protected() removal, new callers. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      51111023
    • P
      sched/fair: Use rq_lock/unlock in online_fair_sched_group · 9addfbd4
      Phil Auld 提交于
      [ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ]
      
      Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
      warning to fire in update_rq_clock. This seems to be caused by onlining
      a new fair sched group not using the rq lock wrappers.
      
        [] rq->clock_update_flags & RQCF_UPDATED
        [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150
      
        [] Call Trace:
        []  online_fair_sched_group+0x53/0x100
        []  cpu_cgroup_css_online+0x16/0x20
        []  online_css+0x1c/0x60
        []  cgroup_apply_control_enable+0x231/0x3b0
        []  cgroup_mkdir+0x41b/0x530
        []  kernfs_iop_mkdir+0x61/0xa0
        []  vfs_mkdir+0x108/0x1a0
        []  do_mkdirat+0x77/0xe0
        []  do_syscall_64+0x55/0x1d0
        []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Using the wrappers in online_fair_sched_group instead of the raw locking
      removes this warning.
      
      [ tglx: Use rq_*lock_irq() ]
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      9addfbd4
    • J
      sched/deadline: Fix bandwidth accounting at all levels after offline migration · 0f308569
      Juri Lelli 提交于
      [ Upstream commit 59d06cea1198d665ba11f7e8c5f45b00ff2e4812 ]
      
      If a task happens to be throttled while the CPU it was running on gets
      hotplugged off, the bandwidth associated with the task is not correctly
      migrated with it when the replenishment timer fires (offline_migration).
      
      Fix things up, for this_bw, running_bw and total_bw, when replenishment
      timer fires and task is migrated (dl_task_offline_migration()).
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bristot@redhat.com
      Cc: claudio@evidence.eu.com
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: mathieu.poirier@linaro.org
      Cc: rostedt@goodmis.org
      Cc: tj@kernel.org
      Cc: tommaso.cucinotta@santannapisa.it
      Link: https://lkml.kernel.org/r/20190719140000.31694-5-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      0f308569
    • J
      sched/core: Fix CPU controller for !RT_GROUP_SCHED · f381d3d2
      Juri Lelli 提交于
      [ Upstream commit a07db5c0865799ebed1f88be0df50c581fb65029 ]
      
      On !CONFIG_RT_GROUP_SCHED configurations it is currently not possible to
      move RT tasks between cgroups to which CPU controller has been attached;
      but it is oddly possible to first move tasks around and then make them
      RT (setschedule to FIFO/RR).
      
      E.g.:
      
        # mkdir /sys/fs/cgroup/cpu,cpuacct/group1
        # chrt -fp 10 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        bash: echo: write error: Invalid argument
        # chrt -op 0 $$
        # echo $$ > /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        # chrt -fp 10 $$
        # cat /sys/fs/cgroup/cpu,cpuacct/group1/tasks
        2345
        2598
        # chrt -p 2345
        pid 2345's current scheduling policy: SCHED_FIFO
        pid 2345's current scheduling priority: 10
      
      Also, as Michal noted, it is currently not possible to enable CPU
      controller on unified hierarchy with !CONFIG_RT_GROUP_SCHED (if there
      are any kernel RT threads in root cgroup, they can't be migrated to the
      newly created CPU controller's root in cgroup_update_dfl_csses()).
      
      Existing code comes with a comment saying the "we don't support RT-tasks
      being in separate groups". Such comment is however stale and belongs to
      pre-RT_GROUP_SCHED times. Also, it doesn't make much sense for
      !RT_GROUP_ SCHED configurations, since checks related to RT bandwidth
      are not performed at all in these cases.
      
      Make moving RT tasks between CPU controller groups viable by removing
      special case check for RT (and DEADLINE) tasks.
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lizefan@huawei.com
      Cc: longman@redhat.com
      Cc: luca.abeni@santannapisa.it
      Cc: rostedt@goodmis.org
      Link: https://lkml.kernel.org/r/20190719063455.27328-1-juri.lelli@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f381d3d2