提交 · 70dc4628964d3b9c6acd04141bfdc72363a88063 · openeuler / Kernel

30 6月, 2023 1 次提交

sched: Fix null pointer derefrence for sd->span · 70dc4628

由 Hui Tang 提交于 6月 30, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7HFZV
CVE: NA

----------------------------------------

There may be NULL pointer derefrence when hotplug running and
creating taskgroup concurrently.

sched_autogroup_create_attach
  -> sched_create_group
    -> alloc_fair_sched_group
      -> init_auto_affinity
        -> init_affinity_domains
           -> cpumask_copy(xx, sched_domain_span(tmp))
              { tmp may be free due rcu lock missing }

{ hotplug will rebuild sched domain }
sched_cpu_activate
  -> build_sched_domains
    -> cpuset_cpu_active
      -> partition_sched_domains
        -> build_sched_domains
          -> cpu_attach_domain
            -> destroy_sched_domains
              -> call_rcu(&sd->rcu, destroy_sched_domains_rcu)

So sd should be protect with rcu lock in entire critical zone.

[  599.811593] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  600.112821] pc : init_affinity_domains+0xf4/0x200
[  600.125918] lr : init_affinity_domains+0xd4/0x200
[  600.331355] Call trace:
[  600.338734]  init_affinity_domains+0xf4/0x200
[  600.347955]  init_auto_affinity+0x78/0xc0
[  600.356622]  alloc_fair_sched_group+0xd8/0x210
[  600.365594]  sched_create_group+0x48/0xc0
[  600.373970]  sched_autogroup_create_attach+0x54/0x190
[  600.383311]  ksys_setsid+0x110/0x130
[  600.391014]  __arm64_sys_setsid+0x18/0x24
[  600.399156]  el0_svc_common+0x118/0x170
[  600.406818]  el0_svc_handler+0x3c/0x80
[  600.414188]  el0_svc+0x8/0x640
[  600.420719] Code: b40002c0 9104e002 f9402061 a9401444 (a9001424)
[  600.430504] SMP: stopping secondary CPUs
[  600.441751] Starting crashdump kernel...

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

70dc4628

25 6月, 2023 2 次提交

sched: Fix memory leak for smart grid · 2b5d1aa5

由 Hui Tang 提交于 6月 25, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7FBJM
CVE: NA

----------------------------------------

Free ad->domains_orig[] in 'free_affinity_domains',
otherwise the memory will leak.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

2b5d1aa5

sched: Delete redundant updates to p->prefer_cpus · 4f3df479

由 Hui Tang 提交于 6月 25, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7F7KV
CVE: NA

-------------------------------

Delete redundant updates to p->prefer_cpus when smart grid used.
Add missed check for p->prefer_cpus when !CONFIG_QOS_SCHED_SMART_GRID.

Fixes: 21e5d85e ("sched: Fix possible deadlock in tg_set_dynamic_affinity_mode")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

4f3df479

20 6月, 2023 3 次提交

sched: Fix memory leak on error branch · d791cf33

由 Hui Tang 提交于 6月 20, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7EBNA
CVE: NA

-------------------------------

Fix memory leak on error branch for smart grid.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

d791cf33

sched: fix dereference NULL pointers · b43a1c9e

由 Hui Tang 提交于 6月 20, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7EA1X
CVE: NA

-------------------------------

tg->auto_affinity is NULL if init_auto_affinity() failed.
So add checking for tg->auto_affinity before derefrence.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

b43a1c9e

sched: Fix timer storm for smart grid · 12521356

由 Hui Tang 提交于 6月 20, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7DSX6
CVE: NA

-------------------------------

Timer storm may be triggered if !cpumask_weight(ad->domains[i])
which is set in cpu offline.

Fixes: 713cfd26 ("sched: Introduce smart grid scheduling strategy for cfs")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

12521356

15 6月, 2023 5 次提交

sched: Fix negative count for jump label · cde6dbb8

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7DA63
CVE: NA

--------------------------------

Add mutex lock to prevent negative count for jump label.

[28612.530675] ------------[ cut here ]------------
[28612.532708] jump label: negative count!
[28612.535031] WARNING: CPU: 4 PID: 3899 at kernel/jump_label.c:202
	__static_key_slow_dec_cpuslocked+0x204/0x240
[28612.538216] Kernel panic - not syncing: panic_on_warn set ...
[28612.538216]
[28612.540487] CPU: 4 PID: 3899 Comm: sh Kdump: loaded Not tainted
[28612.542788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[28612.546455] Call Trace:
[28612.547339]  dump_stack+0xc6/0x11e
[28612.548546]  ? __static_key_slow_dec_cpuslocked+0x200/0x240
[28612.550352]  panic+0x1d6/0x46b
[28612.551375]  ? refcount_error_report+0x2a5/0x2a5
[28612.552915]  ? kmsg_dump_rewind_nolock+0xde/0xde
[28612.554358]  ? sched_clock_cpu+0x18/0x1b0
[28612.555699]  ? __warn+0x1d1/0x210
[28612.556799]  ? __static_key_slow_dec_cpuslocked+0x204/0x240
[28612.558548]  __warn+0x1ec/0x210
[28612.559621]  ? __static_key_slow_dec_cpuslocked+0x204/0x240
[28612.561536]  report_bug+0x1ee/0x2b0
[28612.562706]  fixup_bug.part.4+0x37/0x80
[28612.563937]  do_error_trap+0x21c/0x260
[28612.565109]  ? fixup_bug.part.4+0x80/0x80
[28612.566453]  ? check_preemption_disabled+0x34/0x1f0
[28612.567991]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[28612.569534]  ? lockdep_hardirqs_off+0x1cb/0x2b0
[28612.570993]  ? error_entry+0x9a/0x130
[28612.572138]  ? trace_hardirqs_off_caller+0x59/0x1a0
[28612.573710]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[28612.575232]  invalid_op+0x14/0x20
[root@lo[ca2lh8ost6 12.576387]  ? vprintk_func+0x68/0x1a0
[28612.577827]  ? __static_key_slow_dec_cpuslocked+0x204/0x240
smartg[ri2d]8# 612.579662]  ? __static_key_slow_dec_cpuslocked+0x204/0x240
[28612.581781]  ? static_key_disable+0x30/0x30
[28612.583248]  ? s
tatic_key_slow_dec+0x57/0x90
[28612.584997]  ? tg_set_dynamic_affinity_mode+0x42/0x70
[28612.586714]  ? cgroup_file_write+0x471/0x6a0
[28612.588162]  ? cgroup_css.part.4+0x100/0x100
[28612.589579]  ? cgroup_css.part.4+0x100/0x100
[28612.591031]  ? kernfs_fop_write+0x2af/0x430
[28612.592625]  ? kernfs_vma_page_mkwrite+0x230/0x230
[28612.594274]  ? __vfs_write+0xef/0x680
[28612.595590]  ? kernel_read+0x110/0x110
ea8612.596899]  ? check_preemption_disabled+0x3mkd4ir/: 0canxno1t fcr0
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

cde6dbb8

sched: Fix possible deadlock in tg_set_dynamic_affinity_mode · 21e5d85e

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGD0
CVE: NA

----------------------------------------

Deadlock occurs in two situations as follows:

The first case:

tg_set_dynamic_affinity_mode    --- raw_spin_lock_irq(&auto_affi->lock);
	->start_auto_affintiy   --- trigger timer
		->tg_update_task_prefer_cpus
			>css_task_inter_next
				->raw_spin_unlock_irq

hr_timer_run_queues
  ->sched_auto_affi_period_timer --- try spin lock (&auto_affi->lock)

The second case as follows:

[  291.470810] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  291.472715] rcu:     1-...0: (0 ticks this GP) idle=a6a/1/0x4000000000000002 softirq=78516/78516 fqs=5249
[  291.475268] rcu:     (detected by 6, t=21006 jiffies, g=202169, q=9862)
[  291.477038] Sending NMI from CPU 6 to CPUs 1:
[  291.481268] NMI backtrace for cpu 1
[  291.481273] CPU: 1 PID: 1923 Comm: sh Kdump: loaded Not tainted 4.19.90+ #150
[  291.481278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
[  291.481281] RIP: 0010:queued_spin_lock_slowpath+0x136/0x9a0
[  291.481289] Code: c0 74 3f 49 89 dd 48 89 dd 48 b8 00 00 00 00 00 fc ff df 49 c1 ed 03 83 e5 07 49 01 c5 83 c5 03 48 83 05 c4 66 b9 05 01 f3 90 <41> 0f b6 45 00 40 38 c5 7c 08 84 c0 0f 85 ad 07 00 00 0
[  291.481292] RSP: 0018:ffff88801de87cd8 EFLAGS: 00000002
[  291.481297] RAX: 0000000000000101 RBX: ffff888001be0a28 RCX: ffffffffb8090f7d
[  291.481301] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888001be0a28
[  291.481304] RBP: 0000000000000003 R08: ffffed100037c146 R09: ffffed100037c146
[  291.481307] R10: 000000001106b143 R11: ffffed100037c145 R12: 1ffff11003bd0f9c
[  291.481311] R13: ffffed100037c145 R14: fffffbfff7a38dee R15: dffffc0000000000
[  291.481315] FS:  00007fac4f306740(0000) GS:ffff88801de80000(0000) knlGS:0000000000000000
[  291.481318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  291.481321] CR2: 00007fac4f4bb650 CR3: 00000000046b6000 CR4: 00000000000006e0
[  291.481323] Call Trace:
[  291.481324]  <IRQ>
[  291.481326]  ? osq_unlock+0x2a0/0x2a0
[  291.481329]  ? check_preemption_disabled+0x4c/0x290
[  291.481331]  ? rcu_accelerate_cbs+0x33/0xed0
[  291.481333]  _raw_spin_lock_irqsave+0x83/0xa0
[  291.481336]  sched_auto_affi_period_timer+0x251/0x820
[  291.481338]  ? __remove_hrtimer+0x151/0x200
[  291.481340]  __hrtimer_run_queues+0x39d/0xa50
[  291.481343]  ? tg_update_affinity_domain_down+0x460/0x460
[  291.481345]  ? enqueue_hrtimer+0x2e0/0x2e0
[  291.481348]  ? ktime_get_update_offsets_now+0x1d7/0x2c0
[  291.481350]  hrtimer_run_queues+0x243/0x470
[  291.481352]  run_local_timers+0x5e/0x150
[  291.481354]  update_process_times+0x36/0xb0
[  291.481357]  tick_sched_handle.isra.4+0x7c/0x180
[  291.481359]  tick_nohz_handler+0xd1/0x1d0
[  291.481365]  smp_apic_timer_interrupt+0x12c/0x4e0
[  291.481368]  apic_timer_interrupt+0xf/0x20
[  291.481370]  </IRQ>
[  291.481372]  ? smp_call_function_many+0x68c/0x840
[  291.481375]  ? smp_call_function_many+0x6ab/0x840
[  291.481377]  ? arch_unregister_cpu+0x60/0x60
[  291.481379]  ? native_set_fixmap+0x100/0x180
[  291.481381]  ? arch_unregister_cpu+0x60/0x60
[  291.481384]  ? set_task_select_cpus+0x116/0x940
[  291.481386]  ? smp_call_function+0x53/0xc0
[  291.481388]  ? arch_unregister_cpu+0x60/0x60
[  291.481390]  ? on_each_cpu+0x49/0xf0
[  291.481393]  ? set_task_select_cpus+0x115/0x940
[  291.481395]  ? text_poke_bp+0xff/0x180
[  291.481397]  ? poke_int3_handler+0xc0/0xc0
[  291.481400]  ? __set_prefer_cpus_ptr.constprop.4+0x1cd/0x900
[  291.481402]  ? hrtick+0x1b0/0x1b0
[  291.481404]  ? set_task_select_cpus+0x115/0x940
[  291.481407]  ? __jump_label_transform.isra.0+0x3a1/0x470
[  291.481409]  ? kernel_init+0x280/0x280
[  291.481411]  ? kasan_check_read+0x1d/0x30
[  291.481413]  ? mutex_lock+0x96/0x100
[  291.481415]  ? __mutex_lock_slowpath+0x30/0x30
[  291.481418]  ? arch_jump_label_transform+0x52/0x80
[  291.481420]  ? set_task_select_cpus+0x115/0x940
[  291.481422]  ? __jump_label_update+0x1a1/0x1e0
[  291.481424]  ? jump_label_update+0x2ee/0x3b0
[  291.481427]  ? static_key_slow_inc_cpuslocked+0x1c8/0x2d0
[  291.481430]  ? start_auto_affinity+0x190/0x200
[  291.481432]  ? tg_set_dynamic_affinity_mode+0xad/0xf0
[  291.481435]  ? cpu_affinity_mode_write_u64+0x22/0x30
[  291.481437]  ? cgroup_file_write+0x46f/0x660
[  291.481439]  ? cgroup_init_cftypes+0x300/0x300
[  291.481441]  ? __mutex_lock_slowpath+0x30/0x30
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

21e5d85e

sched: fix WARN found by deadlock detect · 217edab9

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
CVE: NA

----------------------------------------

The WARNING report when run:
echo 1 > /sys/fs/cgroup/cpu/cpu.dynamic_affinity_mode

[  147.276757] WARNING: CPU: 5 PID: 1770 at kernel/cpu.c:326 \
	lockdep_assert_cpus_held+0xac/0xd0
[  147.279670] Kernel panic - not syncing: panic_on_warn set ...
[  147.279670]
[  147.282211] CPU: 5 PID: 1770 Comm: bash Kdump: loaded Not tainted 4.19
[  147.284796] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)..
[  147.290963] Call Trace:
[  147.292459]  dump_stack+0xc6/0x11e
[  147.294295]  ? lockdep_assert_cpus_held+0xa0/0xd0
[  147.296876]  panic+0x1d6/0x46b
[  147.298591]  ? refcount_error_report+0x2a5/0x2a5
[  147.301131]  ? kmsg_dump_rewind_nolock+0xde/0xde
[  147.303738]  ? sched_clock_cpu+0x18/0x1b0
[  147.305943]  ? __warn+0x1d1/0x210
[  147.307831]  ? lockdep_assert_cpus_held+0xac/0xd0
[  147.310469]  __warn+0x1ec/0x210
[  147.312271]  ? lockdep_assert_cpus_held+0xac/0xd0
[  147.314838]  report_bug+0x1ee/0x2b0
[  147.316798]  fixup_bug.part.4+0x37/0x80
[  147.318946]  do_error_trap+0x21c/0x260
[  147.321062]  ? fixup_bug.part.4+0x80/0x80
[  147.323253]  ? check_preemption_disabled+0x34/0x1f0
[  147.324886]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  147.326277]  ? lockdep_hardirqs_off+0x1cb/0x2b0
[  147.327505]  ? error_entry+0x9a/0x130
[  147.328523]  ? trace_hardirqs_off_caller+0x59/0x1a0
[  147.329844]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  147.331124]  invalid_op+0x14/0x20
[  147.332057]  ? vprintk_func+0x68/0x1a0
[  147.333082]  ? lockdep_assert_cpus_held+0xac/0xd0
[  147.334355]  ? lockdep_assert_cpus_held+0xac/0xd0
[  147.335624]  ? static_key_slow_inc_cpuslocked+0x5a/0x230
[  147.337079]  ? tg_set_dynamic_affinity_mode+0x4f/0x70
[  147.338444]  ? cgroup_file_write+0x471/0x6a0
[  147.339604]  ? cgroup_css.part.4+0x100/0x100
[  147.340782]  ? cgroup_css.part.4+0x100/0x100
[  147.341943]  ? kernfs_fop_write+0x2af/0x430
[  147.343083]  ? kernfs_vma_page_mkwrite+0x230/0x230
[  147.344401]  ? __vfs_write+0xef/0x680
[  147.345404]  ? kernel_read+0x110/0x110
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

217edab9

sched: fix smart grid usage count · d9099163

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7D98G
CVE: NA

----------------------------------------

smart_grid_usage_dec() should called when free taskgroup
if the mode is auto.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

d9099163

sched: Add static key to reduce noise · 373fd236

由 Hui Tang 提交于 6月 15, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718

--------------------------------

Add static key to reduce noise when not enable dynamic affinity.
There are better performance in some case, such for lmbench.

Fixes: 243865da ("cpuset: Introduce new interface for scheduler ...")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

373fd236

09 6月, 2023 2 次提交

sched: smart grid: init sched_grid_qos structure on QOS purpose · ce35ded5

由 Wang ShaoBo 提交于 6月 09, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
CVE: NA

----------------------------------------

As smart grid scheduling (SGS) may shrink resources and affect task QOS,
We provide methods for evaluating task QOS in divided grid, we mainly
focus on the following two aspects:

   1. Evaluate whether (such as CPU or memory) resources meet our demand
   2. Ensure the least impact when working with (cpufreq and cpuidle) governors

For tackling this questions, we have summarized several sampling methods
to obtain tasks' characteristics at same time reducing scheduling noise
as much as possible:

  1. we detected the key factors that how sensitive a process is in cpufreq
     or cpuidle adjustment, and to guide the cpufreq/cpuidle governor
  2. We dynamically monitor process memory bandwidth and adjust memory
     allocation to minimize cross-remote memory access
  3. We provide a variety of load tracking mechanisms to adapt to different
     types of task's load change

     ---------------------------------     -----------------
    |            class A              |   |     class B     |
    |    --------        --------     |   |     --------    |
    |   | group0 |      | group1 |    |---|    | group2 |   |----------+
    |    --------        --------     |   |     --------    |          |
    |    CPU/memory sensitive type    |   |   balance type  |          |
     ----------------+----------------     --------+--------           |
                     v                             v                   | (target cpufreq)
     -------------------------------------------------------           | (sensitivity)
    |              Not satisfied with QOS?                  |          |
     --------------------------+----------------------------           |
                               v                                       v
     -------------------------------------------------------     ----------------
    |              expand or shrink resource                |<--|  energy model  |
     ----------------------------+--------------------------     ----------------
                                 v                                     |
     -----------          -----------          ------------            v
    |           |        |           |        |            |     ---------------
    |   GRID0   +--------+   GRID1   +--------+   GRID2    |<-- |   governor    |
    |           |        |           |        |            |     ---------------
     -----------          -----------          ------------
                   \            |            /
                    \  -------------------  /
                      |  pages migration  |
                       -------------------

We will introduce the energy model in the follow-up implementation, and consider
the dynamic affinity adjustment between each divided grid in the runtime.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

ce35ded5

sched: Introduce smart grid scheduling strategy for cfs · 713cfd26

由 Hui Tang 提交于 6月 09, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
CVE: NA

----------------------------------------

We want to dynamically expand or shrink the affinity range of tasks
based on the CPU topology level while meeting the minimum resource
requirements of tasks.

We divide several level of affinity domains according to sched domains:

level4   * SOCKET  [                                                  ]
level3   * DIE     [                             ]
level2   * MC      [             ] [             ]
level1   * SMT     [     ] [     ] [     ] [     ]
level0   * CPU      0   1   2   3   4   5   6   7

Whether users tend to choose power saving or performance will affect
strategy of adjusting affinity, when selecting the power saving mode,
we will choose a more appropriate affinity based on the energy model
to reduce power consumption, while considering the QOS of resources
such as CPU and memory consumption, for instance, if the current task
CPU load is less than required, smart grid will judge whether to aggregate
tasks together into a smaller range or not according to energy model.

The main difference from EAS is that we pay more attention to the impact
of power consumption brought by such as cpuidle and DVFS, and classify
tasks to reduce interference and ensure resource QOS in each divided unit,
which are more suitable for general-purpose on non-heterogeneous CPUs.

        --------        --------        --------
       | group0 |      | group1 |      | group2 |
        --------        --------        --------
	   |                |              |
	   v                |              v
       ---------------------+-----     -----------------
      |                  ---v--   |   |
      |       DIE0      |  MC1 |  |   |   DIE1
      |                  ------   |   |
       ---------------------------     -----------------

We regularly count the resource satisfaction of groups, and adjust the
affinity, scheduling balance and migrating memory will be considered
based on memory location for better meetting resource requirements.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>

713cfd26

06 4月, 2023 3 次提交

sched/fair: Sanitize vruntime of entity being migrated · 2ff1290e

由 Vincent Guittot 提交于 4月 06, 2023

mainline inclusion
from mainline-v6.3-rc4
commit a53ce18c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=a53ce18cacb477dd0513c607f187d16f0fa96f71

--------------------------------

Commit 829c1651 ("sched/fair: sanitize vruntime of entity being placed")
fixes an overflowing bug, but ignore a case that se->exec_start is reset
after a migration.

For fixing this case, we delay the reset of se->exec_start after
placing the entity which se->exec_start to detect long sleeping task.

In order to take into account a possible divergence between the clock_task
of 2 rqs, we increase the threshold to around 104 days.

Fixes: 829c1651 ("sched/fair: sanitize vruntime of entity being placed")
Originally-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NZhang Qiao <zhangqiao22@huawei.com>
Link: https://lore.kernel.org/r/20230317160810.107988-1-vincent.guittot@linaro.orgSigned-off-by: NSongping Yu <yusongping@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

2ff1290e

sched/fair: sanitize vruntime of entity being placed · 9b35c87f

由 Zhang Qiao 提交于 4月 06, 2023

mainline inclusion
from mainline-v6.3-rc4
commit 829c1651
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=829c1651e9c4a6f78398d3e67651cef9bb6b42cc

--------------------------------

When a scheduling entity is placed onto cfs_rq, its vruntime is pulled
to the base level (around cfs_rq->min_vruntime), so that the entity
doesn't gain extra boost when placed backwards.

However, if the entity being placed wasn't executed for a long time, its
vruntime may get too far behind (e.g. while cfs_rq was executing a
low-weight hog), which can inverse the vruntime comparison due to s64
overflow.  This results in the entity being placed with its original
vruntime way forwards, so that it will effectively never get to the cpu.

To prevent that, ignore the vruntime of the entity being placed if it
didn't execute for much longer than the characteristic sheduler time
scale.

[rkagan: formatted, adjusted commit log, comments, cutoff value]
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Co-developed-by: NRoman Kagan <rkagan@amazon.de>
Signed-off-by: NRoman Kagan <rkagan@amazon.de>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230130122216.3555094-1-rkagan@amazon.deSigned-off-by: NSongping Yu <yusongping@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: Nchenhui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

9b35c87f

Revert "sched: Reinit task's vruntime if a task sleep over 200 days" · c0f17a99

由 Songping Yu 提交于 4月 06, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6TE76
CVE: NA

--------------------------------

This reverts commit 3a44e838.
Signed-off-by: NSongping Yu <yusongping@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: Nchenhui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

c0f17a99

24 12月, 2022 1 次提交

sched: Reinit task's vruntime if a task sleep over 200 days · 3a44e838

由 Zhang Qiao 提交于 12月 24, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I67BL1
CVE: NA

-------------------------------

If a task sleep for long time, it maybe cause a s64 overflow
issue at max_vruntime() and the task will be set an incorrect
vruntime, lead to the task be starve.

For fix it, we set the task's vruntime as cfs_rq->min_vruntime
when wakeup.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: Nsongping yu <yusongping@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

3a44e838

16 12月, 2022 1 次提交

sched/qos: Don't unthrottle cfs_rq when cfs_rq is throttled by qos · fbea24f5

由 Zhang Qiao 提交于 12月 16, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I64OUS
CVE: NA

-------------------------------

When a cfs_rq throttled by qos, mark cfs_rq->throttled as 1,
and cfs bw will unthrottled this cfs_rq by mistake, it cause
a list_del_valid warning.
So add macro QOS_THROTTLED(=2), when a cfs_rq is throttled by
qos, we mark the cfs_rq->throttled as QOS_THROTTLED, will check
the value of cfs_rq->throttled before unthrottle a cfs_rq.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: Nzheng zucheng <zhengzucheng@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

fbea24f5

15 8月, 2022 1 次提交

sched: Fix null-ptr-deref in free_fair_sched_group · 0d2df28e

由 Hui Tang 提交于 8月 15, 2022

hulk inclusion
category: bugfix
bugzilla: 187419, https://gitee.com/openeuler/kernel/issues/I5LIPL
CVE: NA

-------------------------------

do_el0_svc+0x50/0x11c arch/arm64/kernel/syscall.c:217
 el0_svc+0x20/0x30 arch/arm64/kernel/entry-common.c:353
 el0_sync_handler+0xe4/0x1e0 arch/arm64/kernel/entry-common.c:369
 el0_sync+0x148/0x180 arch/arm64/kernel/entry.S:683

==================================================================
BUG: KASAN: null-ptr-deref in rq_of kernel/sched/sched.h:1118 [inline]
BUG: KASAN: null-ptr-deref in unthrottle_qos_sched_group kernel/sched/fair.c:7619 [inline]
BUG: KASAN: null-ptr-deref in free_fair_sched_group+0x124/0x320 kernel/sched/fair.c:12131
Read of size 8 at addr 0000000000000130 by task syz-executor100/223

CPU: 3 PID: 223 Comm: syz-executor100 Not tainted 5.10.0 #6
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x40c arch/arm64/kernel/stacktrace.c:132
 show_stack+0x30/0x40 arch/arm64/kernel/stacktrace.c:196
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b4/0x248 lib/dump_stack.c:118
 __kasan_report mm/kasan/report.c:551 [inline]
 kasan_report+0x18c/0x210 mm/kasan/report.c:564
 check_memory_region_inline mm/kasan/generic.c:187 [inline]
 __asan_load8+0x98/0xc0 mm/kasan/generic.c:253
 rq_of kernel/sched/sched.h:1118 [inline]
 unthrottle_qos_sched_group kernel/sched/fair.c:7619 [inline]
 free_fair_sched_group+0x124/0x320 kernel/sched/fair.c:12131
 sched_free_group kernel/sched/core.c:7767 [inline]
 sched_create_group+0x48/0xc0 kernel/sched/core.c:7798
 cpu_cgroup_css_alloc+0x18/0x40 kernel/sched/core.c:7930
 css_create+0x7c/0x4a0 kernel/cgroup/cgroup.c:5328
 cgroup_apply_control_enable+0x288/0x340 kernel/cgroup/cgroup.c:3135
 cgroup_apply_control kernel/cgroup/cgroup.c:3217 [inline]
 cgroup_subtree_control_write+0x668/0x8b0 kernel/cgroup/cgroup.c:3375
 cgroup_file_write+0x1a8/0x37c kernel/cgroup/cgroup.c:3909
 kernfs_fop_write_iter+0x220/0x2f4 fs/kernfs/file.c:296
 call_write_iter include/linux/fs.h:1960 [inline]
 new_sync_write+0x260/0x370 fs/read_write.c:515
 vfs_write+0x3dc/0x4ac fs/read_write.c:602
 ksys_write+0xfc/0x200 fs/read_write.c:655
 __do_sys_write fs/read_write.c:667 [inline]
 __se_sys_write fs/read_write.c:664 [inline]
 __arm64_sys_write+0x50/0x60 fs/read_write.c:664
 __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
 invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
 el0_svc_common.constprop.0+0xf4/0x414 arch/arm64/kernel/syscall.c:155
 do_el0_svc+0x50/0x11c arch/arm64/kernel/syscall.c:217
 el0_svc+0x20/0x30 arch/arm64/kernel/entry-common.c:353
 el0_sync_handler+0xe4/0x1e0 arch/arm64/kernel/entry-common.c:369
 el0_sync+0x148/0x180 arch/arm64/kernel/entry.S:683

So add check for tg->cfs_rq[i] before unthrottle_qos_sched_group() called.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

0d2df28e

21 7月, 2022 3 次提交

sched: Add statistics for scheduler dynamic affinity · ebca52ab

由 Hui Tang 提交于 7月 21, 2022

hulk inclusion
category: feature
bugzilla: 187173, https://gitee.com/openeuler/kernel/issues/I5G4IH
CVE: NA

--------------------------------
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

ebca52ab

sched: Adjust cpu range in load balance dynamicly · 2af15a46

由 Hui Tang 提交于 7月 21, 2022

hulk inclusion
category: feature
bugzilla: 187173, https://gitee.com/openeuler/kernel/issues/I5G4IH
CVE: NA

--------------------------------
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

2af15a46

sched: Adjust wakeup cpu range according CPU util dynamicly · 70a232a5

由 Hui Tang 提交于 7月 21, 2022

hulk inclusion
category: feature
bugzilla: 187173, https://gitee.com/openeuler/kernel/issues/I5G4IH
CVE: NA

--------------------------------

Compare taskgroup 'util_avg' in perferred cpu with capacity preferred cpu,
dynamicly adjust cpu range for task wakeup process.
Signed-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

70a232a5

01 6月, 2022 2 次提交

sched/fair: Fix enqueue_task_fair() warning some more · 47a6e1c3

由 Phil Auld 提交于 6月 01, 2022

mainline inclusion
from mainline-v5.r7-rc6
commit b34cb07d
category: bugfix
bugzilla: 91404, https://gitee.com/openeuler/kernel/issues/I59VLJ
CVE: NA

--------------------------------

The recent patch, fe61468b (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

Fixes: fe61468b ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPhil Auld <pauld@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.comSigned-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

47a6e1c3

sched/fair: Fix enqueue_task_fair warning · b66e423f

由 Vincent Guittot 提交于 6月 01, 2022

mainline inclusion
from mainline-v5.6-rc4
commit fe61468b
category: bugfix
bugzilla: 93902, https://gitee.com/openeuler/kernel/issues/I59VLJ
CVE: NA

--------------------------------

When a cfs rq is throttled, the latter and its child are removed from the
leaf list but their nr_running is not changed which includes staying higher
than 1. When a task is enqueued in this throttled branch, the cfs rqs must
be added back in order to ensure correct ordering in the list but this can
only happens if nr_running == 1.
When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq()
when enqueuing an entity to make sure that the complete branch will be
added.

Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent
is throttled. Iterate the remaining entity to ensure that the complete
branch will be added in the list.
Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Cc: stable@vger.kernel.org
Cc: stable@vger.kernel.org #v5.1+
Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.orgSigned-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

b66e423f

23 5月, 2022 2 次提交

sched/qos: Add qos_tg_{throttle,unthrottle}_{up,down} · 453eaea6

由 Zhang Qiao 提交于 5月 23, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4VZJT
CVE: NA

--------------------------------

1. Qos throttle reuse tg_{throttle,unthrottle}_{up,down} that
can write some cfs-bandwidth fields, it may cause some unknown
data error. So add qos_tg_{throttle,unthrottle}_{up,down} for
qos throttle.

2. walk_tg_tree_from() caller must hold rcu_lock, currently there is
none, so add it now.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

453eaea6

sched: Throttle offline task at tracehook_notify_resume() · 2701a7bb

由 Zhang Qiao 提交于 5月 23, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4VZJT
CVE: NA

--------------------------------

Before, when detect the cpu is overloaded, we throttle offline
tasks at exit_to_user_mode_loop() before returning to user mode.
Some architects(e.g.,arm64) do not support QOS scheduler because
a task do not via exit_to_user_mode_loop() return to userspace at
these platforms.
In order to slove this problem and support qos scheduler on all
architectures, if we require throttling offline tasks, we set flag
TIF_NOTIFY_RESUME to an offline task when it is picked and throttle
it at tracehook_notify_resume().
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

2701a7bb

19 4月, 2022 1 次提交

sched/fair: Fix wrong cpu selecting from isolated domain · 742a0b5b

由 Xunlei Pang 提交于 4月 19, 2022

mainline inclusion
from mainline-v5.10-rc1
commit df3cb4ea
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I52QKB
CVE: NA

--------------------------------

We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
   (thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: NWetp Zhang <wetp.zy@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NJiang Biao <benbjiang@tencent.com>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com

Conflicts:
	kernel/sched/fair.c
Signed-off-by: NYu jiahua <yujiahua1@huawei.com>
Reviewed-by: Nzheng zucheng <zhengzucheng@huawei.com>
Reviewed-by: Nzheng zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

742a0b5b

02 4月, 2022 1 次提交

sched/fair: Add qos_throttle_list node in struct cfs_rq · fb59563c

由 Zhang Qiao 提交于 4月 02, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I50PPU
CVE: NA

-----------------------------------------------------------------

when unthrottle a cfs_rq at distribute_cfs_runtime(), another cpu
may re-throttle this cfs_rq at qos_throttle_cfs_rq() before access
the cfs_rq->throttle_list.next, but meanwhile, qos throttle will
attach the cfs_rq throttle_list node to percpu qos_throttled_cfs_rq,
it will change cfs_rq->throttle_list.next and cause panic or hardlockup
at distribute_cfs_runtime().

Fix it by adding a qos_throttle_list node in struct cfs_rq, and qos
throttle disuse the cfs_rq->throttle_list.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: Nzheng zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>

fb59563c

29 11月, 2021 8 次提交

sched: Introduce handle priority reversion mechanism · a5d94c89

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

When online tasks occupy cpu long time, offline task will not get cpu to run,
the priority inversion issue may be triggered in this case. If the above case
occurs, we will unthrottle offline tasks and let its get a chance to run.
When online tasks occupy cpu over 5s(defaule value), we will unthrottle offline
tasks and enter a msleep loop before exit to usermode util the cpu goto idle.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a5d94c89

sched: unthrottle qos cfs rq when free a task group · fca01562

由 Zhang Qiao 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fca01562

sched: Avoid sched entity null pointer panic · af0d81ab

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

In some corner case, when we throttle the last group sched entity in rq,
kernel will be panic if there is no residual sched entities in rq. So we
add a protection to prevent this case.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

af0d81ab

sched: Clear idle_stamp when unthrottle offline tasks · 698ed717

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

When unthrottle offline tasks successfully, we should clear
idle_stamp.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

698ed717

sched: Optimizing qos scheduler performance · 8ca949bd

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8ca949bd

sched: Fix throttle offline task trigger panic · 1d29740b

由 Zheng Zucheng 提交于 11月 29, 2021

hulk inclusion
category: bugfix
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

offline task invokes sched_setscheduler interface to change the
scheduling policy to SCHED_OTHER, trigger a system panic.
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1d29740b

sched: Unthrottle the throttled cfs rq when offline rq · 863fbf96

由 Zhang Qiao 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

In order for the throttled task to continue running
after the CPU is offline, we need unthrottle the
throttled cfs rq saved in the qos_throttled_cfs_rq.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

863fbf96

sched: Throttle qos cfs_rq when current cpu is running online task · b0d97ae0

由 Zhang Qiao 提交于 11月 29, 2021

hulk inclusion
category: feature
bugzilla: 51828, https://gitee.com/openeuler/kernel/issues/I4K96G
CVE: NA

--------------------------------

In a co-location scenario, we usually deploy online and offline
task groups in the same server.

The online tasks are more important than offline tasks. And to
avoid offline tasks affects online tasks, we will throttle the
offline tasks group when some online task groups running in the
same cpu and unthrottle offline tasks when the cpu is about to
enter idle state.
Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b0d97ae0

17 9月, 2021 1 次提交

tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code · 96d4baaa

由 Eric W. Biederman 提交于 9月 17, 2021

mainline inclusion
from mainline-5.4-rc1
commit 154abafc
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW
CVE: NA

-------------------------------------------------

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

[Li Hua: change task_rcu_dereference to rcu_dereference in sync_runqueues_membarrier_state]

Ref: 8f95c90c ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Chris Metcalf <cmetcalf@ezchip.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Kirill Tkhai <tkhai@yandex.ru>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>

Conflicts:
	kernel/sched/membarrier.c
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

96d4baaa

02 8月, 2021 1 次提交

sched/fair: Fix CFS bandwidth hrtimer expiry type · 7201ffc0

由 Odin Ugedal 提交于 8月 02, 2021

stable inclusion
from linux-4.19.199
commit aa36bd8fc187f513af2d250b82a3c913053ceaf1

--------------------------------

[ Upstream commit 72d0ad7c ]

The time remaining until expiry of the refresh_timer can be negative.
Casting the type to an unsigned 64-bit value will cause integer
underflow, making the runtime_refresh_within return false instead of
true. These situations are rare, but they do happen.

This does not cause user-facing issues or errors; other than
possibly unthrottling cfs_rq's using runtime from the previous period(s),
making the CFS bandwidth enforcement less strict in those (special)
situations.
Signed-off-by: NOdin Ugedal <odin@uged.al>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBen Segall <bsegall@google.com>
Link: https://lore.kernel.org/r/20210629121452.18429-1-odin@uged.alSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7201ffc0

19 7月, 2021 1 次提交

sched/fair: Make sure to update tg contrib for blocked load · a955dfaf

由 Vincent Guittot 提交于 7月 19, 2021

stable inclusion
from linux-4.19.195
commit 4af84445075da26439598984d1a01c1cdadb0e2d

--------------------------------

commit 02da26ad upstream.

During the update of fair blocked load (__update_blocked_fair()), we
update the contribution of the cfs in tg->load_avg if cfs_rq's pelt
has decayed.  Nevertheless, the pelt values of a cfs_rq could have
been recently updated while propagating the change of a child. In this
case, cfs_rq's pelt will not decayed because it has already been
updated and we don't update tg->load_avg.

__update_blocked_fair
  ...
  for_each_leaf_cfs_rq_safe: child cfs_rq
    update cfs_rq_load_avg() for child cfs_rq
    ...
    update_load_avg(cfs_rq_of(se), se, 0)
      ...
      update cfs_rq_load_avg() for parent cfs_rq
		-propagation of child's load makes parent cfs_rq->load_sum
		 becoming null
        -UPDATE_TG is not set so it doesn't update parent
		 cfs_rq->tg_load_avg_contrib
  ..
  for_each_leaf_cfs_rq_safe: parent cfs_rq
    update cfs_rq_load_avg() for parent cfs_rq
      - nothing to do because parent cfs_rq has already been updated
		recently so cfs_rq->tg_load_avg_contrib is not updated
    ...
    parent cfs_rq is decayed
      list_del_leaf_cfs_rq parent cfs_rq
	  - but it still contibutes to tg->load_avg

we must set UPDATE_TG flags when propagting pending load to the parent

Fixes: 039ae8bc ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Reported-by: NOdin Ugedal <odin@uged.al>
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NOdin Ugedal <odin@uged.al>
Link: https://lkml.kernel.org/r/20210527122916.27683-3-vincent.guittot@linaro.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a955dfaf

30 6月, 2021 1 次提交

sched/fair: Fix unfairness caused by missing load decay · 909ad0a4

由 Odin Ugedal 提交于 6月 30, 2021

stable inclusion
from linux-4.19.191
commit 434ea8c1d1bf296a2597aeb28f6ccf62ae82f235

--------------------------------

[ Upstream commit 0258bdfa ]

This fixes an issue where old load on a cfs_rq is not properly decayed,
resulting in strange behavior where fairness can decrease drastically.
Real workloads with equally weighted control groups have ended up
getting a respective 99% and 1%(!!) of cpu time.

When an idle task is attached to a cfs_rq by attaching a pid to a cgroup,
the old load of the task is attached to the new cfs_rq and sched_entity by
attach_entity_cfs_rq. If the task is then moved to another cpu (and
therefore cfs_rq) before being enqueued/woken up, the load will be moved
to cfs_rq->removed from the sched_entity. Such a move will happen when
enforcing a cpuset on the task (eg. via a cgroup) that force it to move.

The load will however not be removed from the task_group itself, making
it look like there is a constant load on that cfs_rq. This causes the
vruntime of tasks on other sibling cfs_rq's to increase faster than they
are supposed to; causing severe fairness issues. If no other task is
started on the given cfs_rq, and due to the cpuset it would not happen,
this load would never be properly unloaded. With this patch the load
will be properly removed inside update_blocked_averages. This also
applies to tasks moved to the fair scheduling class and moved to another
cpu, and this path will also fix that. For fork, the entity is queued
right away, so this problem does not affect that.

This applies to cases where the new process is the first in the cfs_rq,
issue introduced 3d30544f ("sched/fair: Apply more PELT fixes"), and
when there has previously been load on the cgroup but the cgroup was
removed from the leaflist due to having null PELT load, indroduced
in 039ae8bc ("sched/fair: Fix O(nr_cgroups) in the load balancing
path").

For a simple cgroup hierarchy (as seen below) with two equally weighted
groups, that in theory should get 50/50 of cpu time each, it often leads
to a load of 60/40 or 70/30.

parent/
  cg-1/
    cpu.weight: 100
    cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    cpuset.cpus: 1

If the hierarchy is deeper (as seen below), while keeping cg-1 and cg-2
equally weighted, they should still get a 50/50 balance of cpu time.
This however sometimes results in a balance of 10/90 or 1/99(!!) between
the task groups.

$ ps u -C stress
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root       18568  1.1  0.0   3684   100 pts/12   R+   13:36   0:00 stress --cpu 1
root       18580 99.3  0.0   3684   100 pts/12   R+   13:36   0:09 stress --cpu 1

parent/
  cg-1/
    cpu.weight: 100
    sub-group/
      cpu.weight: 1
      cpuset.cpus: 1
  cg-2/
    cpu.weight: 100
    sub-group/
      cpu.weight: 10000
      cpuset.cpus: 1

This can be reproduced by attaching an idle process to a cgroup and
moving it to a given cpuset before it wakes up. The issue is evident in
many (if not most) container runtimes, and has been reproduced
with both crun and runc (and therefore docker and all its "derivatives"),
and with both cgroup v1 and v2.

Fixes: 3d30544f ("sched/fair: Apply more PELT fixes")
Fixes: 039ae8bc ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Signed-off-by: NOdin Ugedal <odin@uged.al>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210501141950.23622-2-odin@uged.alSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

909ad0a4

openeuler / Kernel 10 个月 前同步成功

openeuler / Kernel
10 个月前同步成功