提交 · 27c8c87a18d985171a518b2e8219a71f6076b7ab · openeuler / Kernel

20 6月, 2023 2 次提交

sched: Adjust few parameters range for smart grid · 27c8c87a

由 Hui Tang 提交于 6月 20, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7EEF3
CVE: NA

-------------------------------

Adjust few parameters range for smart grid.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

27c8c87a

sched: fix dereference NULL pointers · b43a1c9e

由 Hui Tang 提交于 6月 20, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7EA1X
CVE: NA

-------------------------------

tg->auto_affinity is NULL if init_auto_affinity() failed.
So add checking for tg->auto_affinity before derefrence.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

b43a1c9e

15 6月, 2023 1 次提交

sched: Fix possible deadlock in tg_set_dynamic_affinity_mode · 21e5d85e

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGD0
CVE: NA

----------------------------------------

Deadlock occurs in two situations as follows:

The first case:

tg_set_dynamic_affinity_mode    --- raw_spin_lock_irq(&auto_affi->lock);
	->start_auto_affintiy   --- trigger timer
		->tg_update_task_prefer_cpus
			>css_task_inter_next
				->raw_spin_unlock_irq

hr_timer_run_queues
  ->sched_auto_affi_period_timer --- try spin lock (&auto_affi->lock)

The second case as follows:

[  291.470810] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  291.472715] rcu:     1-...0: (0 ticks this GP) idle=a6a/1/0x4000000000000002 softirq=78516/78516 fqs=5249
[  291.475268] rcu:     (detected by 6, t=21006 jiffies, g=202169, q=9862)
[  291.477038] Sending NMI from CPU 6 to CPUs 1:
[  291.481268] NMI backtrace for cpu 1
[  291.481273] CPU: 1 PID: 1923 Comm: sh Kdump: loaded Not tainted 4.19.90+ #150
[  291.481278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  291.481281] RIP: 0010:queued_spin_lock_slowpath+0x136/0x9a0
[  291.481289] Code: c0 74 3f 49 89 dd 48 89 dd 48 b8 00 00 00 00 00 fc ff df 49 c1 ed 03 83 e5 07 49 01 c5 83 c5 03 48 83 05 c4 66 b9 05 01 f3 90 <41> 0f b6 45 00 40 38 c5 7c 08 84 c0 0f 85 ad 07 00 00 0
[  291.481292] RSP: 0018:ffff88801de87cd8 EFLAGS: 00000002
[  291.481297] RAX: 0000000000000101 RBX: ffff888001be0a28 RCX: ffffffffb8090f7d
[  291.481301] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888001be0a28
[  291.481304] RBP: 0000000000000003 R08: ffffed100037c146 R09: ffffed100037c146
[  291.481307] R10: 000000001106b143 R11: ffffed100037c145 R12: 1ffff11003bd0f9c
[  291.481311] R13: ffffed100037c145 R14: fffffbfff7a38dee R15: dffffc0000000000
[  291.481315] FS:  00007fac4f306740(0000) GS:ffff88801de80000(0000) knlGS:0000000000000000
[  291.481318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  291.481321] CR2: 00007fac4f4bb650 CR3: 00000000046b6000 CR4: 00000000000006e0
[  291.481323] Call Trace:
[  291.481324]  <IRQ>
[  291.481326]  ? osq_unlock+0x2a0/0x2a0
[  291.481329]  ? check_preemption_disabled+0x4c/0x290
[  291.481331]  ? rcu_accelerate_cbs+0x33/0xed0
[  291.481333]  _raw_spin_lock_irqsave+0x83/0xa0
[  291.481336]  sched_auto_affi_period_timer+0x251/0x820
[  291.481338]  ? __remove_hrtimer+0x151/0x200
[  291.481340]  __hrtimer_run_queues+0x39d/0xa50
[  291.481343]  ? tg_update_affinity_domain_down+0x460/0x460
[  291.481345]  ? enqueue_hrtimer+0x2e0/0x2e0
[  291.481348]  ? ktime_get_update_offsets_now+0x1d7/0x2c0
[  291.481350]  hrtimer_run_queues+0x243/0x470
[  291.481352]  run_local_timers+0x5e/0x150
[  291.481354]  update_process_times+0x36/0xb0
[  291.481357]  tick_sched_handle.isra.4+0x7c/0x180
[  291.481359]  tick_nohz_handler+0xd1/0x1d0
[  291.481365]  smp_apic_timer_interrupt+0x12c/0x4e0
[  291.481368]  apic_timer_interrupt+0xf/0x20
[  291.481370]  </IRQ>
[  291.481372]  ? smp_call_function_many+0x68c/0x840
[  291.481375]  ? smp_call_function_many+0x6ab/0x840
[  291.481377]  ? arch_unregister_cpu+0x60/0x60
[  291.481379]  ? native_set_fixmap+0x100/0x180
[  291.481381]  ? arch_unregister_cpu+0x60/0x60
[  291.481384]  ? set_task_select_cpus+0x116/0x940
[  291.481386]  ? smp_call_function+0x53/0xc0
[  291.481388]  ? arch_unregister_cpu+0x60/0x60
[  291.481390]  ? on_each_cpu+0x49/0xf0
[  291.481393]  ? set_task_select_cpus+0x115/0x940
[  291.481395]  ? text_poke_bp+0xff/0x180
[  291.481397]  ? poke_int3_handler+0xc0/0xc0
[  291.481400]  ? __set_prefer_cpus_ptr.constprop.4+0x1cd/0x900
[  291.481402]  ? hrtick+0x1b0/0x1b0
[  291.481404]  ? set_task_select_cpus+0x115/0x940
[  291.481407]  ? __jump_label_transform.isra.0+0x3a1/0x470
[  291.481409]  ? kernel_init+0x280/0x280
[  291.481411]  ? kasan_check_read+0x1d/0x30
[  291.481413]  ? mutex_lock+0x96/0x100
[  291.481415]  ? __mutex_lock_slowpath+0x30/0x30
[  291.481418]  ? arch_jump_label_transform+0x52/0x80
[  291.481420]  ? set_task_select_cpus+0x115/0x940
[  291.481422]  ? __jump_label_update+0x1a1/0x1e0
[  291.481424]  ? jump_label_update+0x2ee/0x3b0
[  291.481427]  ? static_key_slow_inc_cpuslocked+0x1c8/0x2d0
[  291.481430]  ? start_auto_affinity+0x190/0x200
[  291.481432]  ? tg_set_dynamic_affinity_mode+0xad/0xf0
[  291.481435]  ? cpu_affinity_mode_write_u64+0x22/0x30
[  291.481437]  ? cgroup_file_write+0x46f/0x660
[  291.481439]  ? cgroup_init_cftypes+0x300/0x300
[  291.481441]  ? __mutex_lock_slowpath+0x30/0x30
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

21e5d85e

09 6月, 2023 1 次提交

sched: Introduce smart grid scheduling strategy for cfs · 713cfd26

由 Hui Tang 提交于 6月 09, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
CVE: NA

----------------------------------------

We want to dynamically expand or shrink the affinity range of tasks
based on the CPU topology level while meeting the minimum resource
requirements of tasks.

We divide several level of affinity domains according to sched domains:

level4   * SOCKET  [                                                  ]
level3   * DIE     [                             ]
level2   * MC      [             ] [             ]
level1   * SMT     [     ] [     ] [     ] [     ]
level0   * CPU      0   1   2   3   4   5   6   7

Whether users tend to choose power saving or performance will affect
strategy of adjusting affinity, when selecting the power saving mode,
we will choose a more appropriate affinity based on the energy model
to reduce power consumption, while considering the QOS of resources
such as CPU and memory consumption, for instance, if the current task
CPU load is less than required, smart grid will judge whether to aggregate
tasks together into a smaller range or not according to energy model.

The main difference from EAS is that we pay more attention to the impact
of power consumption brought by such as cpuidle and DVFS, and classify
tasks to reduce interference and ensure resource QOS in each divided unit,
which are more suitable for general-purpose on non-heterogeneous CPUs.

        --------        --------        --------
       | group0 |      | group1 |      | group2 |
        --------        --------        --------
	   |                |              |
	   v                |              v
       ---------------------+-----     -----------------
      |                  ---v--   |   |
      |       DIE0      |  MC1 |  |   |   DIE1
      |                  ------   |   |
       ---------------------------     -----------------

We regularly count the resource satisfaction of groups, and adjust the
affinity, scheduling balance and migrating memory will be considered
based on memory location for better meetting resource requirements.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

713cfd26

08 5月, 2023 1 次提交

sched_getaffinity: don't assume 'cpumask_size()' is fully initialized · e7b1f698

由 Linus Torvalds 提交于 5月 08, 2023

stable inclusion
from stable-v4.19.280
commit 178ff87d2a0c2d3d74081e1c2efbb33b3487267d
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM
CVE: NA

--------------------------------

[ Upstream commit 6015b1ac ]

The getaffinity() system call uses 'cpumask_size()' to decide how big
the CPU mask is - so far so good.  It is indeed the allocation size of a
cpumask.

But the code also assumes that the whole allocation is initialized
without actually doing so itself.  That's wrong, because we might have
fixed-size allocations (making copying and clearing more efficient), but
not all of it is then necessarily used if 'nr_cpu_ids' is smaller.

Having checked other users of 'cpumask_size()', they all seem to be ok,
either using it purely for the allocation size, or explicitly zeroing
the cpumask before using the size in bytes to copy it.

See for example the ublk_ctrl_get_queue_affinity() function that uses
the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
cleared, whether the storage is on the stack or if it was an external
allocation.

Fix this by just zeroing the allocation before using it.  Do the same
for the compat version of sched_getaffinity(), which had the same logic.

Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
access the bits.  For a cpumask_var_t, it ends up being a pointer to the
same data either way, but it's just a good idea to treat it like you
would a 'cpumask_t'.  The compat case already did that.
Reported-by: NRyan Roberts <ryan.roberts@arm.com>
Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
Cc: Yury Norov <yury.norov@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

e7b1f698

06 4月, 2023 1 次提交

sched/fair: Sanitize vruntime of entity being migrated · 2ff1290e

由 Vincent Guittot 提交于 4月 06, 2023

mainline inclusion
from mainline-v6.3-rc4
commit a53ce18c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=a53ce18cacb477dd0513c607f187d16f0fa96f71

--------------------------------

Commit 829c1651 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NZhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.orgSigned-off-by: NSongping Yu <yusongping@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

2ff1290e

21 7月, 2022 2 次提交

sched: Add statistics for scheduler dynamic affinity · ebca52ab

由 Hui Tang 提交于 7月 21, 2022

hulk inclusion
category: feature
bugzilla: 187173, https://gitee.com/openeuler/kernel/issues/I5G4IH
CVE: NA

--------------------------------
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

ebca52ab

cpuset: Introduce new interface for scheduler dynamic affinity · 243865da

由 Hui Tang 提交于 7月 21, 2022

hulk inclusion
category: feature
bugzilla: 187173, https://gitee.com/openeuler/kernel/issues/I5G4IH
CVE: NA

--------------------------------

Add 'prefer_cpus' sysfs and related interface in cgroup cpuset.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

243865da

15 6月, 2022 2 次提交

Revert "sched: Fix sched_fork() access an invalid sched_task_group" · 44bf746f

由 Zhang Qiao 提交于 6月 15, 2022

hulk inclusion
category: bugfix
bugzilla: 186973, https://gitee.com/openeuler/kernel/issues/I5CA6K
CVE: NA

--------------------------------

This reverts commit 74bd9b82.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

44bf746f

Revert "sched: Fix yet more sched_fork() races" · 82a543fe

由 Zhang Qiao 提交于 6月 15, 2022

hulk inclusion
category: bugfix
bugzilla: 186973, https://gitee.com/openeuler/kernel/issues/I5CA6K
CVE: NA

--------------------------------

This reverts commit af98db5f.
the patch af98db5f("sched: Fix yet more sched_fork()") may
be cause a process sleep at cgroup_post_fork()->freezer_fork()
while taking group_threadgroup_rwsem lock long time, it cause
a problem that other tasks will wait while fork child process
and the system will stall.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

82a543fe

19 4月, 2022 1 次提交

sched: Fix yet more sched_fork() races · af98db5f

由 Peter Zijlstra 提交于 4月 19, 2022

mainline inclusion
from mainline-v5.17-rc5
commit b1e82065
category: bugfix
bugzilla: 186609, https://gitee.com/openeuler/kernel/issues/I532B0
CVE: NA

--------------------------------

Where commit 4ef0c5c6 ("kernel/sched: Fix sched_fork() access an
invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
race vs syscalls by not placing the task on the runqueue before it
gets exposed through the pidhash.

Commit 13765de8 ("sched/fair: Fix fault in reweight_entity") is
trying to fix a single instance of this, instead fix the whole class
of issues, effectively reverting this commit.

Fixes: 4ef0c5c6 ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NTadeusz Struk <tadeusz.struk@linaro.org>
Tested-by: NZhang Qiao <zhangqiao22@huawei.com>
Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net

conflict:
	include/linux/sched/task.h
	kernel/fork.c
	kernel/sched/core.c
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

af98db5f

11 3月, 2022 1 次提交

sched: Fix sleeping in atomic context at cpu_qos_write() · 21544710

由 Zhang Qiao 提交于 3月 11, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4WOPM
CVE: NA

--------------------------------

cfs_bandwidth_usage_inc() need hold jump_label_mutex and
might sleep, so we can not call it in atomic context.
Fix this by moving cfs_bandwidth_usage_{inc,dec}() out of
rcu read critical section.

Fixes: f7b390cd ("sched: Change cgroup task scheduler policy")
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>

21544710

28 12月, 2021 1 次提交

sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain() · ebcfc028

由 Vincent Donnefort 提交于 12月 27, 2021

stable inclusion
from linux-4.19.218
commit 71731e0d68f4dd352a958942d6d0e39cb0f0fa58

--------------------------------

[ Upstream commit 42dc938a ]

Nothing protects the access to the per_cpu variable sd_llc_id. When testing
the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
update_top_cache_domain(). One scenario being:

              CPU1                            CPU2
  ==================================================================

  per_cpu(sd_llc_id, CPUX) => 0
                                    partition_sched_domains_locked()
      				      detach_destroy_domains()
  cpus_share_cache(CPUX, CPUX)          update_top_cache_domain(CPUX)
    per_cpu(sd_llc_id, CPUX) => 0
                                          per_cpu(sd_llc_id, CPUX) = CPUX
    per_cpu(sd_llc_id, CPUX) => CPUX
    return false

ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
is a warning triggered from ttwu_queue_wakelist().

Avoid a such race in cpus_share_cache() by always returning true when
this_cpu == that_cpu.

Fixes: 518cd623 ("sched: Only queue remote wakeups when crossing cache boundaries")
Reported-by: NJing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: NVincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ebcfc028

29 11月, 2021 7 次提交

sched: Introduce handle priority reversion mechanism · a5d94c89

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

When online tasks occupy cpu long time, offline task will not get cpu to run,
the priority inversion issue may be triggered in this case. If the above case
occurs, we will unthrottle offline tasks and let its get a chance to run.
When online tasks occupy cpu over 5s(defaule value), we will unthrottle offline
tasks and enter a msleep loop before exit to usermode util the cpu goto idle.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a5d94c89

sched: Fix offline task can't be killed in a timely · f60bf584

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

If online tasks occupy 100% CPU resources, offline tasks can't be scheduled
since offline tasks are throttled, as a result, offline task can't timely
respond after receiving SIGKILL signal.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f60bf584

sched: Fix throttle offline task trigger panic · 1d29740b

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

offline task invokes sched_setscheduler interface to change the
scheduling policy to SCHED_OTHER, trigger a system panic.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1d29740b

sched: Remove residual checkings for qos scheduler · 2f9034b2

由 Chen Hui 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

Remove residual checkings when move tasks to new task group or
write new value to the cpu.qos_level cgroup file.
Signed-off-by: NChen Hui <clare.chenhui@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2f9034b2

sched: Change cgroup task scheduler policy · f7b390cd

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

Added restrictions on modifying task scheduling policies:
1. task cannot be changed from offline to online.
2. When the scheduling policy of parent task is modified, the scheduling policy
   of all child node tasks is changed to be the same as that of the parent node.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f7b390cd

sched: Throttle qos cfs_rq when current cpu is running online task · b0d97ae0

由 Zhang Qiao 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

In a co-location scenario, we usually deploy online and offline
task groups in the same server.

The online tasks are more important than offline tasks. And to
avoid offline tasks affects online tasks, we will throttle the
offline tasks group when some online task groups running in the
same cpu and unthrottle offline tasks when the cpu is about to
enter idle state.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b0d97ae0

sched: Introduce qos scheduler for co-location · 2666666a

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

We introduce the idea of qos level to scheduler, which now is
supported with different scheduler policies. The qos scheduler
will change the policy of correlative tasks when the qos level
of a task group is modified with cpu.qos_level cpu cgroup file.
In this way we are able to satisfy different needs of tasks in
different qos levels.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2666666a

17 9月, 2021 2 次提交

tasks, sched/core: RCUify the assignment of rq->curr · 8e099519

由 Eric W. Biederman 提交于 9月 17, 2021

mainline inclusion
from mainline-5.4-rc1
commit 5311a98f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW
CVE: NA

-------------------------------------------------

The current task on the runqueue is currently read with rcu_dereference().

To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.

Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler.  So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.

The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.

As I write this there are effectively two users of rcu_dereference() on
rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference().  Then there is
task_numa_compare() in kernel/sched/fair.c.  My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".

The code in __schedule() essentially does:
	rq_lock(...);
	smp_mb__after_spinlock();

	next = pick_next_task(...);
	rq->curr = next;

	context_switch(prev, next);

At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier.  Further there is a full memory barrier
in context_switch().

This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing.  A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();

For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.

This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().

Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl".  These fields are the sched_entity structures
of the varies schedulers.

Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.

Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8e099519

tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue · af5294bd

由 Eric W. Biederman 提交于 9月 17, 2021

mainline inclusion
from mainline-5.4-rc1
commit 0ff7b2cf
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW
CVE: NA

-------------------------------------------------

In the ordinary case today the RCU grace period for a task_struct is
triggered when another process wait's for it's zombine and causes the
kernel to call release_task().  As the waiting task has to receive a
signal and then act upon it before this happens, typically this will
occur after the original task as been removed from the runqueue.

Unfortunaty in some cases such as self reaping tasks it can be shown
that release_task() will be called starting the grace period for
task_struct long before the task leaves the runqueue.

Therefore use put_task_struct_rcu_user() in finish_task_switch() to
guarantee that the there is a RCU lifetime after the task
leaves the runqueue.

Besides the change in the start of the RCU grace period for the
task_struct this change may cause perf_event_delayed_put and
trace_sched_process_free.  The function perf_event_delayed_put boils
down to just a WARN_ON for cases that I assume never show happen.  So
I don't see any problem with delaying it.

The function trace_sched_process_free is a trace point and thus
visible to user space.  Occassionally userspace has the strangest
dependencies so this has a miniscule chance of causing a regression.
This change only changes the timing of when the tracepoint is called.
The change in timing arguably gives userspace a more accurate picture
of what is going on.  So I don't expect there to be a regression.

In the case where a task self reaps we are pretty much guaranteed that
the RCU grace period is delayed.  So we should get quite a bit of
coverage in of this worst case for the change in a normal threaded
workload.  So I expect any issues to turn up quickly or not at all.

I have lightly tested this change and everything appears to work
fine.
Inspired-by: NLinus Torvalds <torvalds@linux-foundation.org>
Inspired-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>

 Conflicts:
         kernel/fork.c
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

af5294bd

01 9月, 2021 1 次提交

sched: Fix sched_fork() access an invalid sched_task_group · 74bd9b82

由 Zhang Qiao 提交于 9月 01, 2021

hulk inclusion
category: bugfix
bugzilla: 177205, https://gitee.com/openeuler/kernel/issues/I484Y1
CVE: NA

--------------------------------

There is a small race between copy_process() and sched_fork()
where child->sched_task_group point to an already freed pointer.

parent doing fork()      | someone moving the parent
				to another cgroup
-------------------------------+-------------------------------
copy_process()
     + dup_task_struct()<1>
                                parent move to another cgroup,
                                and free the old cgroup. <2>
     + sched_fork()
       + __set_task_cpu()<3>
         + task_fork_fair()
           + sched_slice()<4>

In the worst case, this bug can lead to "use-after-free" and
cause panic as shown above,
(1)parent copy its sched_task_group to child at <1>;
(2)someone move the parent to another cgroup and free the old
   cgroup at <2>;
(3)the sched_task_group and cfs_rq that belong to the old cgroup
   will be accessed at <3> and <4>, which cause a panic:

[89249.732198] BUG: unable to handle kernel NULL pointer
dereference at 0000000000000000
[89249.732701] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD
2029955067 PMD 0
[89249.733005] Oops: 0000 [#1] SMP PTI
[89249.733288] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded
Tainted: G           OE    --------- -  - 4.18.0.x86_64+ #1
[89249.734318] RIP: 0010:sched_slice+0x84/0xc0
....
[89249.737910] Call Trace:
[89249.738181]  task_fork_fair+0x81/0x120
[89249.738457]  sched_fork+0x132/0x240
[89249.738732]  copy_process.part.5+0x675/0x20e0
[89249.739010]  ? __handle_mm_fault+0x63f/0x690
[89249.739286]  _do_fork+0xcd/0x3b0
[89249.739558]  do_syscall_64+0x5d/0x1d0
[89249.739830]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[89249.740107] RIP: 0033:0x7f04418cd7e1

When a new process is forked, cgroup_post_fork() associates it
with the cgroup of its parent. Therefore this commit move the
__set_task_cpu() and task_fork() that access some cgroup-related
fields(sched_task_group and cfs_rq) to sched_post_fork() and
call sched_post_fork() after cgroup_post_fork().

Fixes: 8323f26c ("sched: Fix race in task_group")
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

74bd9b82

07 5月, 2021 2 次提交
- C
  Revert "sched: Introduce qos scheduler for co-location" · ea08b58e
  由 Cheng Jian 提交于 5月 07, 2021
```
This reverts commit 709d6561.
```
  ea08b58e
- C
  Revert "sched: Throttle qos cfs_rq when current cpu is running online task" · fae96c9c
  由 Cheng Jian 提交于 5月 07, 2021
```
This reverts commit 8a8e23b4.
```
  fae96c9c
30 4月, 2021 2 次提交

sched: Throttle qos cfs_rq when current cpu is running online task · 8a8e23b4

由 Zhang Qiao 提交于 4月 30, 2021

hulk inclusion
category: feature
bugzilla: 51828
CVE: NA

--------------------------------

In a co-location scenario, we usually deploy online and offline
task groups in the same server.

The online tasks are more important than offline tasks. And to
avoid offline tasks affects online tasks, we will throttle the
offline tasks group when some online task groups running in the
same cpu and unthrottle offline tasks when the cpu is about to
enter idle state.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NHui Chen <clare.chenhui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8a8e23b4

sched: Introduce qos scheduler for co-location · 709d6561

由 Zhang Qiao 提交于 4月 30, 2021

hulk inclusion
category: feature
bugzilla: 51828
CVE: NA

--------------------------------

We introduce the idea of qos level to scheduler, which now is
supported with different scheduler policies. The qos scheduler
will change the policy of correlative tasks when the qos level
of a task group is modified with cpu.qos_level cpu cgroup file.
In this way we are able to satisfy different needs of tasks in
different qos levels.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NHui Chen <clare.chenhui@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

709d6561

15 4月, 2021 2 次提交

io-wq: small threadpool implementation for io_uring · e515fdb2

由 Jens Axboe 提交于 4月 15, 2021

mainline inclusion
from mainline-5.5-rc1
commit 771b53d0
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

This adds support for io-wq, a smaller and specialized thread pool
implementation. This is meant to replace workqueues for io_uring. Among
the reasons for this addition are:

- We can assign memory context smarter and more persistently if we
  manage the life time of threads.

- We can drop various work-arounds we have in io_uring, like the
  async_list.

- We can implement hashed work insertion, to manage concurrency of
  buffered writes without needing a) an extra workqueue, or b)
  needlessly making the concurrency of said workqueue very low
  which hurts performance of multiple buffered file writers.

- We can implement cancel through signals, for cancelling
  interruptible work like read/write (or send/recv) to/from sockets.

- We need the above cancel for being able to assign and use file tables
  from a process.

- We can implement a more thorough cancel operation in general.

- We need it to move towards a syslet/threadlet model for even faster
  async execution. For that we need to take ownership of the used
  threads.

This list is just off the top of my head. Performance should be the
same, or better, at least that's what I've seen in my testing. io-wq
supports basic NUMA functionality, setting up a pool per node.

io-wq hooks up to the scheduler schedule in/out just like workqueue
and uses that to drive the need for more/less workers.
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	fs/Kconfig
	fs/Makefile
	include/linux/sched.h
	[ Patch d7fefcc8("mm/cma: add PF flag to force non cma
	  alloc") is not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

e515fdb2

sched/core, workqueues: Distangle worker accounting from rq lock · eed3aaf0

由 Thomas Gleixner 提交于 4月 15, 2021

mainline inclusion
from mainline-5.2-rc1
commit 6d25be57
category: feature
bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27
CVE: NA
---------------------------

The worker accounting for CPU bound workers is plugged into the core
scheduler code and the wakeup code. This is not a hard requirement and
can be avoided by keeping track of the state in the workqueue code
itself.

Keep track of the sleeping state in the worker itself and call the
notifier before entering the core scheduler. There might be false
positives when the task is woken between that call and actually
scheduling, but that's not really different from scheduling and being
woken immediately after switching away. When nr_running is updated when
the task is retunrning from schedule() then it is later compared when it
is done from ttwu().

[ bigeasy: preempt_disable() around wq_worker_sleeping() by Daniel Bristot de Oliveira ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/ad2b29b5715f970bffc1a7026cabd6ff0b24076a.1532952814.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

Conflicts:
	kernel/workqueue_internal.h
	[ Patch 1b69ac6b("psi: fix aggregation idle shut-off") is
          not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

eed3aaf0

14 4月, 2021 3 次提交

sched/fair: introduce SCHED_STEAL · fbd5102b

由 Cheng Jian 提交于 4月 14, 2021

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
CVE: NA

---------------------------

Introduce CONFIG_SCHED_STEAL to limit the impact of steal task.

1). If turn off CONFIG_SCHED_STEAL, then all the changes will not
exist, for we use some empty functions, so this depends on compiler
optimization.

2). enable CONFIG_SCHED_STEAL, but disable STEAL and schedstats, it
will introduce some impact whith schedstat check. but this has little
effect on performance. This will be our default choice.
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fbd5102b

sched/fair: Provide idle search schedstats · 1d2cc076

由 Steve Sistare 提交于 4月 14, 2021

hulk inclusion
category: feature
bugzilla: 38261, https://bugzilla.openeuler.org/show_bug.cgi?id=23
CVE: NA

---------------------------

Add schedstats to measure the effectiveness of searching for idle CPUs
and stealing tasks.  This is a temporary patch intended for use during
development only.  SCHEDSTAT_VERSION is bumped to 16, and the following
fields are added to the per-CPU statistics of /proc/schedstat:

field 10: # of times select_idle_sibling "easily" found an idle CPU --
          prev or target is idle.
field 11: # of times select_idle_sibling searched and found an idle cpu.
field 12: # of times select_idle_sibling searched and found an idle core.
field 13: # of times select_idle_sibling failed to find anything idle.
field 14: time in nanoseconds spent in functions that search for idle
          CPUs and search for tasks to steal.
field 15: # of times an idle CPU steals a task from another CPU.
field 16: # of times try_steal finds overloaded CPUs but no task is
           migratable.
Signed-off-by: NSteve Sistare <steven.sistare@oracle.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1d2cc076

sched/core: Create task_has_idle_policy() helper · 11b26af5

由 Viresh Kumar 提交于 4月 14, 2021

mainline inclusion
from mainline-v5.0-rc1
commit 1da1843f
category: feature
bugzilla: 38260, https://bugzilla.openeuler.org/show_bug.cgi?id=22
CVE: NA
---------------------------

We already have task_has_rt_policy() and task_has_dl_policy() helpers,
create task_has_idle_policy() as well and update sched core to start
using it.

While at it, use task_has_dl_policy() at one more place.
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

11b26af5

22 2月, 2021 1 次提交

sched: Reenable interrupts in do_sched_yield() · 5fab601f

由 Thomas Gleixner 提交于 2月 22, 2021

stable inclusion
from linux-4.19.164
commit b6b6ba5754eee1907497504e4b31e22c78f7670f

--------------------------------

[ Upstream commit 345a957f ]

do_sched_yield() invokes schedule() with interrupts disabled which is
not allowed. This goes back to the pre git era to commit a6efb709806c
("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

Reenable interrupts and remove the misleading comment which "explains" it.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.deSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>

5fab601f

22 9月, 2020 3 次提交

sched: Fix unreliable rseq cpu_id for new tasks · 7f1d4048

由 Mathieu Desnoyers 提交于 9月 22, 2020

stable inclusion
from linux-4.19.134
commit d2fc2e5774eb1911829ae761bc1569a05b72ebdc

--------------------------------

commit ce3614da upstream.

While integrating rseq into glibc and replacing glibc's sched_getcpu
implementation with rseq, glibc's tests discovered an issue with
incorrect __rseq_abi.cpu_id field value right after the first time
a newly created process issues sched_setaffinity.

For the records, it triggers after building glibc and running tests, and
then issuing:

  for x in {1..2000} ; do posix/tst-affinity-static  & done

and shows up as:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

This is caused by the scheduler invoking __set_task_cpu() directly from
sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
is done by set_task_cpu().

Add the missing rseq_migrate() to both functions. The only other direct
use of __set_task_cpu() is done by init_idle(), which does not involve a
user-space task.

Based on my testing with the glibc test-case, just adding rseq_migrate()
to wake_up_new_task() is sufficient to fix the observed issue. Also add
it to sched_fork() to keep things consistent.

The reason why this never triggered so far with the rseq/basic_test
selftest is unclear.

The current use of sched_getcpu(3) does not typically require it to be
always accurate. However, use of the __rseq_abi.cpu_id field within rseq
critical sections requires it to be accurate. If it is not accurate, it
can cause corruption in the per-cpu data targeted by rseq critical
sections in user-space.
Reported-By: NFlorian Weimer <fweimer@redhat.com>
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-By: NFlorian Weimer <fweimer@redhat.com>
Cc: stable@vger.kernel.org # v4.18+
Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7f1d4048

sched/core: Fix PI boosting between RT and DEADLINE tasks · 7bab2bb2

由 Juri Lelli 提交于 9月 22, 2020

stable inclusion
from linux-4.19.131
commit e852bdcce9e41c26127e4b919210e3445590a1a4

--------------------------------

[ Upstream commit 740797ce ]

syzbot reported the following warning:

 WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
 enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

 623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
 624 {
 625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
 626 	struct rq *rq = rq_of_dl_rq(dl_rq);
 627
 628 	WARN_ON(dl_se->dl_boosted);
 629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
        [...]
     }

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomainSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7bab2bb2

sched/core: Fix illegal RCU from offline CPUs · 6cf457d1

由 Peter Zijlstra 提交于 9月 22, 2020

stable inclusion
from linux-4.19.129
commit 373491f1f41896241864b527b584856d8a510946

--------------------------------

[ Upstream commit bf2c59fc ]

In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

  WARNING: suspicious RCU usage
  -----------------------------
  kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

  other info that might help us debug this:

  RCU used illegally from offline CPU!
  Call Trace:
   dump_stack+0xf4/0x164 (unreliable)
   lockdep_rcu_suspicious+0x140/0x164
   get_work_pool+0x110/0x150
   __queue_work+0x1bc/0xca0
   queue_work_on+0x114/0x120
   css_release+0x9c/0xc0
   percpu_ref_put_many+0x204/0x230
   free_pcp_prepare+0x264/0x570
   free_unref_page+0x38/0xf0
   __mmdrop+0x21c/0x2c0
   idle_task_exit+0x170/0x1b0
   pnv_smp_cpu_kill_self+0x38/0x2e0
   cpu_die+0x48/0x64
   arch_cpu_idle_dead+0x30/0x50
   do_idle+0x2f4/0x470
   cpu_startup_entry+0x38/0x40
   start_secondary+0x7a8/0xa80
   start_secondary_resume+0x10/0x14
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NQian Cai <cai@lca.pw>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pwSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6cf457d1

05 3月, 2020 2 次提交

sched/membarrier: Fix p->mm->membarrier_state racy load · 08946ecc

由 Mathieu Desnoyers 提交于 2月 05, 2020

mainline inclusion
from mainline-5.4-rc1
commit 227a4aad
category: bugfix
bugzilla: 28332
CVE: NA

-------------------------------------------------

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

[Cheng Jian: use task_rcu_dereference in sync_runqueues_membarrier_state]
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@linux.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

08946ecc

sched: Clean up active_mm reference counting · cfd49aa0

由 Peter Zijlstra 提交于 2月 05, 2020

mainline inclusion
from mainline-5.4-rc1
commit 139d025c
category: bugfix
bugzilla: 28332 [preparation]
CVE: NA

-------------------------------------------------

The current active_mm reference counting is confusing and sub-optimal.

Rewrite the code to explicitly consider the 4 separate cases:

    user -> user

	When switching between two user tasks, all we need to consider
	is switch_mm().

    user -> kernel

	When switching from a user task to a kernel task (which
	doesn't have an associated mm) we retain the last mm in our
	active_mm. Increment a reference count on active_mm.

  kernel -> kernel

	When switching between kernel threads, all we need to do is
	pass along the active_mm reference.

  kernel -> user

	When switching between a kernel and user task, we must switch
	from the last active_mm to the next mm, hoping of course that
	these are the same. Decrement a reference on the active_mm.

The code keeps a different order, because as you'll note, both 'to
user' cases require switch_mm().

And where the old code would increment/decrement for the 'kernel ->
kernel' case, the new code observes this is a neutral operation and
avoids touching the reference count.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NRik van Riel <riel@surriel.com>
Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: luto@kernel.org
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cfd49aa0

27 12月, 2019 2 次提交

sched/core: Avoid spurious lock dependencies · 011e08fb

由 Peter Zijlstra 提交于 12月 16, 2019

[ Upstream commit ff51ff84 ]

While seemingly harmless, __sched_fork() does hrtimer_init(), which,
when DEBUG_OBJETS, can end up doing allocations.

This then results in the following lock order:

  rq->lock
    zone->lock.rlock
      batched_entropy_u64.lock

Which in turn causes deadlocks when we do wakeups while holding that
batched_entropy lock -- as the random code does.

Solve this by moving __sched_fork() out from under rq->lock. This is
safe because nothing there relies on rq->lock, as also evident from the
other __sched_fork() callsite.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: akpm@linux-foundation.org
Cc: bigeasy@linutronix.de
Cc: cl@linux.com
Cc: keescook@chromium.org
Cc: penberg@kernel.org
Cc: rientjes@google.com
Cc: thgarnie@google.com
Cc: tytso@mit.edu
Cc: will@kernel.org
Fixes: b7d5dc21 ("random: add a spinlock_t to struct batched_entropy")
Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

011e08fb

sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · ff4666ed

由 KeMeng Shi 提交于 10月 14, 2019

[ Upstream commit 714e501e ]

An oops can be triggered in the scheduler when running qemu on arm64:

 Unable to handle kernel paging request at virtual address ffff000008effe40
 Internal error: Oops: 96000007 [#1] SMP
 Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
 pstate: 20000085 (nzCv daIf -PAN -UAO)
 pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
 lr : move_queued_task.isra.21+0x124/0x298
 ...
 Call trace:
  __ll_sc___cmpxchg_case_acq_4+0x4/0x20
  __migrate_task+0xc8/0xe0
  migration_cpu_stop+0x170/0x180
  cpu_stopper_thread+0xec/0x178
  smpboot_thread_fn+0x1ac/0x1e8
  kthread+0x134/0x138
  ret_from_fork+0x10/0x18

__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

The reproduce the crash:

  1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
  sched_setaffinity.

  2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
  and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

  3) Oops appears if the invalid CPU is set in memory after tested cpumask.
Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ff4666ed

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功