1. 20 6月, 2023 2 次提交
  2. 15 6月, 2023 1 次提交
    • H
      sched: Fix possible deadlock in tg_set_dynamic_affinity_mode · 21e5d85e
      Hui Tang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7CGD0
      CVE: NA
      
      ----------------------------------------
      
      Deadlock occurs in two situations as follows:
      
      The first case:
      
      tg_set_dynamic_affinity_mode    --- raw_spin_lock_irq(&auto_affi->lock);
      	->start_auto_affintiy   --- trigger timer
      		->tg_update_task_prefer_cpus
      			>css_task_inter_next
      				->raw_spin_unlock_irq
      
      hr_timer_run_queues
        ->sched_auto_affi_period_timer --- try spin lock (&auto_affi->lock)
      
      The second case as follows:
      
      [  291.470810] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
      [  291.472715] rcu:     1-...0: (0 ticks this GP) idle=a6a/1/0x4000000000000002 softirq=78516/78516 fqs=5249
      [  291.475268] rcu:     (detected by 6, t=21006 jiffies, g=202169, q=9862)
      [  291.477038] Sending NMI from CPU 6 to CPUs 1:
      [  291.481268] NMI backtrace for cpu 1
      [  291.481273] CPU: 1 PID: 1923 Comm: sh Kdump: loaded Not tainted 4.19.90+ #150
      [  291.481278] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
      [  291.481281] RIP: 0010:queued_spin_lock_slowpath+0x136/0x9a0
      [  291.481289] Code: c0 74 3f 49 89 dd 48 89 dd 48 b8 00 00 00 00 00 fc ff df 49 c1 ed 03 83 e5 07 49 01 c5 83 c5 03 48 83 05 c4 66 b9 05 01 f3 90 <41> 0f b6 45 00 40 38 c5 7c 08 84 c0 0f 85 ad 07 00 00 0
      [  291.481292] RSP: 0018:ffff88801de87cd8 EFLAGS: 00000002
      [  291.481297] RAX: 0000000000000101 RBX: ffff888001be0a28 RCX: ffffffffb8090f7d
      [  291.481301] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888001be0a28
      [  291.481304] RBP: 0000000000000003 R08: ffffed100037c146 R09: ffffed100037c146
      [  291.481307] R10: 000000001106b143 R11: ffffed100037c145 R12: 1ffff11003bd0f9c
      [  291.481311] R13: ffffed100037c145 R14: fffffbfff7a38dee R15: dffffc0000000000
      [  291.481315] FS:  00007fac4f306740(0000) GS:ffff88801de80000(0000) knlGS:0000000000000000
      [  291.481318] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  291.481321] CR2: 00007fac4f4bb650 CR3: 00000000046b6000 CR4: 00000000000006e0
      [  291.481323] Call Trace:
      [  291.481324]  <IRQ>
      [  291.481326]  ? osq_unlock+0x2a0/0x2a0
      [  291.481329]  ? check_preemption_disabled+0x4c/0x290
      [  291.481331]  ? rcu_accelerate_cbs+0x33/0xed0
      [  291.481333]  _raw_spin_lock_irqsave+0x83/0xa0
      [  291.481336]  sched_auto_affi_period_timer+0x251/0x820
      [  291.481338]  ? __remove_hrtimer+0x151/0x200
      [  291.481340]  __hrtimer_run_queues+0x39d/0xa50
      [  291.481343]  ? tg_update_affinity_domain_down+0x460/0x460
      [  291.481345]  ? enqueue_hrtimer+0x2e0/0x2e0
      [  291.481348]  ? ktime_get_update_offsets_now+0x1d7/0x2c0
      [  291.481350]  hrtimer_run_queues+0x243/0x470
      [  291.481352]  run_local_timers+0x5e/0x150
      [  291.481354]  update_process_times+0x36/0xb0
      [  291.481357]  tick_sched_handle.isra.4+0x7c/0x180
      [  291.481359]  tick_nohz_handler+0xd1/0x1d0
      [  291.481365]  smp_apic_timer_interrupt+0x12c/0x4e0
      [  291.481368]  apic_timer_interrupt+0xf/0x20
      [  291.481370]  </IRQ>
      [  291.481372]  ? smp_call_function_many+0x68c/0x840
      [  291.481375]  ? smp_call_function_many+0x6ab/0x840
      [  291.481377]  ? arch_unregister_cpu+0x60/0x60
      [  291.481379]  ? native_set_fixmap+0x100/0x180
      [  291.481381]  ? arch_unregister_cpu+0x60/0x60
      [  291.481384]  ? set_task_select_cpus+0x116/0x940
      [  291.481386]  ? smp_call_function+0x53/0xc0
      [  291.481388]  ? arch_unregister_cpu+0x60/0x60
      [  291.481390]  ? on_each_cpu+0x49/0xf0
      [  291.481393]  ? set_task_select_cpus+0x115/0x940
      [  291.481395]  ? text_poke_bp+0xff/0x180
      [  291.481397]  ? poke_int3_handler+0xc0/0xc0
      [  291.481400]  ? __set_prefer_cpus_ptr.constprop.4+0x1cd/0x900
      [  291.481402]  ? hrtick+0x1b0/0x1b0
      [  291.481404]  ? set_task_select_cpus+0x115/0x940
      [  291.481407]  ? __jump_label_transform.isra.0+0x3a1/0x470
      [  291.481409]  ? kernel_init+0x280/0x280
      [  291.481411]  ? kasan_check_read+0x1d/0x30
      [  291.481413]  ? mutex_lock+0x96/0x100
      [  291.481415]  ? __mutex_lock_slowpath+0x30/0x30
      [  291.481418]  ? arch_jump_label_transform+0x52/0x80
      [  291.481420]  ? set_task_select_cpus+0x115/0x940
      [  291.481422]  ? __jump_label_update+0x1a1/0x1e0
      [  291.481424]  ? jump_label_update+0x2ee/0x3b0
      [  291.481427]  ? static_key_slow_inc_cpuslocked+0x1c8/0x2d0
      [  291.481430]  ? start_auto_affinity+0x190/0x200
      [  291.481432]  ? tg_set_dynamic_affinity_mode+0xad/0xf0
      [  291.481435]  ? cpu_affinity_mode_write_u64+0x22/0x30
      [  291.481437]  ? cgroup_file_write+0x46f/0x660
      [  291.481439]  ? cgroup_init_cftypes+0x300/0x300
      [  291.481441]  ? __mutex_lock_slowpath+0x30/0x30
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
      21e5d85e
  3. 09 6月, 2023 1 次提交
    • H
      sched: Introduce smart grid scheduling strategy for cfs · 713cfd26
      Hui Tang 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7BQZ0
      CVE: NA
      
      ----------------------------------------
      
      We want to dynamically expand or shrink the affinity range of tasks
      based on the CPU topology level while meeting the minimum resource
      requirements of tasks.
      
      We divide several level of affinity domains according to sched domains:
      
      level4   * SOCKET  [                                                  ]
      level3   * DIE     [                             ]
      level2   * MC      [             ] [             ]
      level1   * SMT     [     ] [     ] [     ] [     ]
      level0   * CPU      0   1   2   3   4   5   6   7
      
      Whether users tend to choose power saving or performance will affect
      strategy of adjusting affinity, when selecting the power saving mode,
      we will choose a more appropriate affinity based on the energy model
      to reduce power consumption, while considering the QOS of resources
      such as CPU and memory consumption, for instance, if the current task
      CPU load is less than required, smart grid will judge whether to aggregate
      tasks together into a smaller range or not according to energy model.
      
      The main difference from EAS is that we pay more attention to the impact
      of power consumption brought by such as cpuidle and DVFS, and classify
      tasks to reduce interference and ensure resource QOS in each divided unit,
      which are more suitable for general-purpose on non-heterogeneous CPUs.
      
              --------        --------        --------
             | group0 |      | group1 |      | group2 |
              --------        --------        --------
      	   |                |              |
      	   v                |              v
             ---------------------+-----     -----------------
            |                  ---v--   |   |
            |       DIE0      |  MC1 |  |   |   DIE1
            |                  ------   |   |
             ---------------------------     -----------------
      
      We regularly count the resource satisfaction of groups, and adjust the
      affinity, scheduling balance and migrating memory will be considered
      based on memory location for better meetting resource requirements.
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
      Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
      Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com>
      713cfd26
  4. 08 5月, 2023 1 次提交
    • L
      sched_getaffinity: don't assume 'cpumask_size()' is fully initialized · e7b1f698
      Linus Torvalds 提交于
      stable inclusion
      from stable-v4.19.280
      commit 178ff87d2a0c2d3d74081e1c2efbb33b3487267d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I715PM
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 6015b1ac ]
      
      The getaffinity() system call uses 'cpumask_size()' to decide how big
      the CPU mask is - so far so good.  It is indeed the allocation size of a
      cpumask.
      
      But the code also assumes that the whole allocation is initialized
      without actually doing so itself.  That's wrong, because we might have
      fixed-size allocations (making copying and clearing more efficient), but
      not all of it is then necessarily used if 'nr_cpu_ids' is smaller.
      
      Having checked other users of 'cpumask_size()', they all seem to be ok,
      either using it purely for the allocation size, or explicitly zeroing
      the cpumask before using the size in bytes to copy it.
      
      See for example the ublk_ctrl_get_queue_affinity() function that uses
      the proper 'zalloc_cpumask_var()' to make sure that the whole mask is
      cleared, whether the storage is on the stack or if it was an external
      allocation.
      
      Fix this by just zeroing the allocation before using it.  Do the same
      for the compat version of sched_getaffinity(), which had the same logic.
      
      Also, for consistency, make sched_getaffinity() use 'cpumask_bits()' to
      access the bits.  For a cpumask_var_t, it ends up being a pointer to the
      same data either way, but it's just a good idea to treat it like you
      would a 'cpumask_t'.  The compat case already did that.
      Reported-by: NRyan Roberts <ryan.roberts@arm.com>
      Link: https://lore.kernel.org/lkml/7d026744-6bd6-6827-0471-b5e8eae0be3f@arm.com/
      Cc: Yury Norov <yury.norov@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      e7b1f698
  5. 06 4月, 2023 1 次提交
  6. 21 7月, 2022 2 次提交
  7. 15 6月, 2022 2 次提交
  8. 19 4月, 2022 1 次提交
  9. 11 3月, 2022 1 次提交
  10. 28 12月, 2021 1 次提交
  11. 29 11月, 2021 7 次提交
  12. 17 9月, 2021 2 次提交
    • E
      tasks, sched/core: RCUify the assignment of rq->curr · 8e099519
      Eric W. Biederman 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 5311a98f
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW
      CVE: NA
      
      -------------------------------------------------
      
      The current task on the runqueue is currently read with rcu_dereference().
      
      To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
      to be paired with rcu_assign_pointer() of rq->curr.  Which provides the
      memory barrier necessary to order assignments to the task_struct
      and the assignment to rq->curr.
      
      Unfortunately the assignment of rq->curr in __schedule is a hot path,
      and it has already been show that additional barriers in that code
      will reduce the performance of the scheduler.  So I will attempt to
      describe below why you can effectively have ordinary RCU semantics
      without any additional barriers.
      
      The assignment of rq->curr in init_idle is a slow path called once
      per cpu and that can use rcu_assign_pointer() without any concerns.
      
      As I write this there are effectively two users of rcu_dereference() on
      rq->curr.  There is the membarrier code in kernel/sched/membarrier.c
      that only looks at "->mm" after the rcu_dereference().  Then there is
      task_numa_compare() in kernel/sched/fair.c.  My best reading of the
      code shows that task_numa_compare only access: "->flags",
      "->cpus_ptr", "->numa_group", "->numa_faults[]",
      "->total_numa_faults", and "->se.cfs_rq".
      
      The code in __schedule() essentially does:
      	rq_lock(...);
      	smp_mb__after_spinlock();
      
      	next = pick_next_task(...);
      	rq->curr = next;
      
      	context_switch(prev, next);
      
      At the start of the function the rq_lock/smp_mb__after_spinlock
      pair provides a full memory barrier.  Further there is a full memory barrier
      in context_switch().
      
      This means that any task that has already run and modified itself (the
      common case) has already seen two memory barriers before __schedule()
      runs and begins executing.  A task that modifies itself then sees a
      third full memory barrier pair with the rq_lock();
      
      For a brand new task that is enqueued with wake_up_new_task() there
      are the memory barriers present from the taking and release the
      pi_lock and the rq_lock as the processes is enqueued as well as the
      full memory barrier at the start of __schedule() assuming __schedule()
      happens on the same cpu.
      
      This means that by the time we reach the assignment of rq->curr
      except for values on the task struct modified in pick_next_task
      the code has the same guarantees as if it used rcu_assign_pointer().
      
      Reading through all of the implementations of pick_next_task it
      appears pick_next_task is limited to modifying the task_struct fields
      "->se", "->rt", "->dl".  These fields are the sched_entity structures
      of the varies schedulers.
      
      Further "->se.cfs_rq" is only changed in cgroup attach/move operations
      initialized by userspace.
      
      Unless I have missed something this means that in practice that the
      users of "rcu_dereference(rq->curr)" get normal RCU semantics of
      rcu_dereference() for the fields the care about, despite the
      assignment of rq->curr in __schedule() ot using rcu_assign_pointer.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
      Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8e099519
    • E
      tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue · af5294bd
      Eric W. Biederman 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 0ff7b2cf
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3UKOW
      CVE: NA
      
      -------------------------------------------------
      
      In the ordinary case today the RCU grace period for a task_struct is
      triggered when another process wait's for it's zombine and causes the
      kernel to call release_task().  As the waiting task has to receive a
      signal and then act upon it before this happens, typically this will
      occur after the original task as been removed from the runqueue.
      
      Unfortunaty in some cases such as self reaping tasks it can be shown
      that release_task() will be called starting the grace period for
      task_struct long before the task leaves the runqueue.
      
      Therefore use put_task_struct_rcu_user() in finish_task_switch() to
      guarantee that the there is a RCU lifetime after the task
      leaves the runqueue.
      
      Besides the change in the start of the RCU grace period for the
      task_struct this change may cause perf_event_delayed_put and
      trace_sched_process_free.  The function perf_event_delayed_put boils
      down to just a WARN_ON for cases that I assume never show happen.  So
      I don't see any problem with delaying it.
      
      The function trace_sched_process_free is a trace point and thus
      visible to user space.  Occassionally userspace has the strangest
      dependencies so this has a miniscule chance of causing a regression.
      This change only changes the timing of when the tracepoint is called.
      The change in timing arguably gives userspace a more accurate picture
      of what is going on.  So I don't expect there to be a regression.
      
      In the case where a task self reaps we are pretty much guaranteed that
      the RCU grace period is delayed.  So we should get quite a bit of
      coverage in of this worst case for the change in a normal threaded
      workload.  So I expect any issues to turn up quickly or not at all.
      
      I have lightly tested this change and everything appears to work
      fine.
      Inspired-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Inspired-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NLi Hua <hucool.lihua@huawei.com>
      Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
      
       Conflicts:
               kernel/fork.c
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      af5294bd
  13. 01 9月, 2021 1 次提交
    • Z
      sched: Fix sched_fork() access an invalid sched_task_group · 74bd9b82
      Zhang Qiao 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 177205, https://gitee.com/openeuler/kernel/issues/I484Y1
      CVE: NA
      
      --------------------------------
      
      There is a small race between copy_process() and sched_fork()
      where child->sched_task_group point to an already freed pointer.
      
      parent doing fork()      | someone moving the parent
      				to another cgroup
      -------------------------------+-------------------------------
      copy_process()
           + dup_task_struct()<1>
                                      parent move to another cgroup,
                                      and free the old cgroup. <2>
           + sched_fork()
             + __set_task_cpu()<3>
               + task_fork_fair()
                 + sched_slice()<4>
      
      In the worst case, this bug can lead to "use-after-free" and
      cause panic as shown above,
      (1)parent copy its sched_task_group to child at <1>;
      (2)someone move the parent to another cgroup and free the old
         cgroup at <2>;
      (3)the sched_task_group and cfs_rq that belong to the old cgroup
         will be accessed at <3> and <4>, which cause a panic:
      
      [89249.732198] BUG: unable to handle kernel NULL pointer
      dereference at 0000000000000000
      [89249.732701] PGD 8000001fa0a86067 P4D 8000001fa0a86067 PUD
      2029955067 PMD 0
      [89249.733005] Oops: 0000 [#1] SMP PTI
      [89249.733288] CPU: 7 PID: 648398 Comm: ebizzy Kdump: loaded
      Tainted: G           OE    --------- -  - 4.18.0.x86_64+ #1
      [89249.734318] RIP: 0010:sched_slice+0x84/0xc0
      ....
      [89249.737910] Call Trace:
      [89249.738181]  task_fork_fair+0x81/0x120
      [89249.738457]  sched_fork+0x132/0x240
      [89249.738732]  copy_process.part.5+0x675/0x20e0
      [89249.739010]  ? __handle_mm_fault+0x63f/0x690
      [89249.739286]  _do_fork+0xcd/0x3b0
      [89249.739558]  do_syscall_64+0x5d/0x1d0
      [89249.739830]  entry_SYSCALL_64_after_hwframe+0x65/0xca
      [89249.740107] RIP: 0033:0x7f04418cd7e1
      
      When a new process is forked, cgroup_post_fork() associates it
      with the cgroup of its parent. Therefore this commit move the
      __set_task_cpu() and task_fork() that access some cgroup-related
      fields(sched_task_group and cfs_rq) to sched_post_fork() and
      call sched_post_fork() after cgroup_post_fork().
      
      Fixes: 8323f26c ("sched: Fix race in task_group")
      Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com>
      Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      74bd9b82
  14. 07 5月, 2021 2 次提交
  15. 30 4月, 2021 2 次提交
  16. 15 4月, 2021 2 次提交
  17. 14 4月, 2021 3 次提交
  18. 22 2月, 2021 1 次提交
  19. 22 9月, 2020 3 次提交
    • M
      sched: Fix unreliable rseq cpu_id for new tasks · 7f1d4048
      Mathieu Desnoyers 提交于
      stable inclusion
      from linux-4.19.134
      commit d2fc2e5774eb1911829ae761bc1569a05b72ebdc
      
      --------------------------------
      
      commit ce3614da upstream.
      
      While integrating rseq into glibc and replacing glibc's sched_getcpu
      implementation with rseq, glibc's tests discovered an issue with
      incorrect __rseq_abi.cpu_id field value right after the first time
      a newly created process issues sched_setaffinity.
      
      For the records, it triggers after building glibc and running tests, and
      then issuing:
      
        for x in {1..2000} ; do posix/tst-affinity-static  & done
      
      and shows up as:
      
      error: Unexpected CPU 2, expected 0
      error: Unexpected CPU 2, expected 0
      error: Unexpected CPU 2, expected 0
      error: Unexpected CPU 2, expected 0
      error: Unexpected CPU 138, expected 0
      error: Unexpected CPU 138, expected 0
      error: Unexpected CPU 138, expected 0
      error: Unexpected CPU 138, expected 0
      
      This is caused by the scheduler invoking __set_task_cpu() directly from
      sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
      is done by set_task_cpu().
      
      Add the missing rseq_migrate() to both functions. The only other direct
      use of __set_task_cpu() is done by init_idle(), which does not involve a
      user-space task.
      
      Based on my testing with the glibc test-case, just adding rseq_migrate()
      to wake_up_new_task() is sufficient to fix the observed issue. Also add
      it to sched_fork() to keep things consistent.
      
      The reason why this never triggered so far with the rseq/basic_test
      selftest is unclear.
      
      The current use of sched_getcpu(3) does not typically require it to be
      always accurate. However, use of the __rseq_abi.cpu_id field within rseq
      critical sections requires it to be accurate. If it is not accurate, it
      can cause corruption in the per-cpu data targeted by rseq critical
      sections in user-space.
      Reported-By: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-By: NFlorian Weimer <fweimer@redhat.com>
      Cc: stable@vger.kernel.org # v4.18+
      Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7f1d4048
    • J
      sched/core: Fix PI boosting between RT and DEADLINE tasks · 7bab2bb2
      Juri Lelli 提交于
      stable inclusion
      from linux-4.19.131
      commit e852bdcce9e41c26127e4b919210e3445590a1a4
      
      --------------------------------
      
      [ Upstream commit 740797ce ]
      
      syzbot reported the following warning:
      
       WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
       enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504
      
      At deadline.c:628 we have:
      
       623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
       624 {
       625 	struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
       626 	struct rq *rq = rq_of_dl_rq(dl_rq);
       627
       628 	WARN_ON(dl_se->dl_boosted);
       629 	WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
              [...]
           }
      
      Which means that setup_new_dl_entity() has been called on a task
      currently boosted. This shouldn't happen though, as setup_new_dl_entity()
      is only called when the 'dynamic' deadline of the new entity
      is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
      condition.
      
      Digging through the PI code I noticed that what above might in fact happen
      if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
      first branch of boosting conditions we check only if a pi_task 'dynamic'
      deadline is earlier than mutex holder's and in this case we set mutex
      holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
      initialized if such tasks get boosted at some point (or if they become
      DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
      to 0 and this verifies the aforementioned condition.
      
      Fix it by checking that the potential donor task is actually (even if
      temporary because in turn boosted) running at DEADLINE priority before
      using its 'dynamic' deadline value.
      
      Fixes: 2d3d891d ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
      Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
      Signed-off-by: NJuri Lelli <juri.lelli@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Tested-by: NDaniel Wagner <dwagner@suse.de>
      Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomainSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      7bab2bb2
    • P
      sched/core: Fix illegal RCU from offline CPUs · 6cf457d1
      Peter Zijlstra 提交于
      stable inclusion
      from linux-4.19.129
      commit 373491f1f41896241864b527b584856d8a510946
      
      --------------------------------
      
      [ Upstream commit bf2c59fc ]
      
      In the CPU-offline process, it calls mmdrop() after idle entry and the
      subsequent call to cpuhp_report_idle_dead(). Once execution passes the
      call to rcu_report_dead(), RCU is ignoring the CPU, which results in
      lockdep complaining when mmdrop() uses RCU from either memcg or
      debugobjects below.
      
      Fix it by cleaning up the active_mm state from BP instead. Every arch
      which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
      from AP. The only exception is parisc because it switches them to
      &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
      but the patch will still work there because it calls mmgrab(&init_mm) in
      smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().
      
        WARNING: suspicious RCU usage
        -----------------------------
        kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!
      
        other info that might help us debug this:
      
        RCU used illegally from offline CPU!
        Call Trace:
         dump_stack+0xf4/0x164 (unreliable)
         lockdep_rcu_suspicious+0x140/0x164
         get_work_pool+0x110/0x150
         __queue_work+0x1bc/0xca0
         queue_work_on+0x114/0x120
         css_release+0x9c/0xc0
         percpu_ref_put_many+0x204/0x230
         free_pcp_prepare+0x264/0x570
         free_unref_page+0x38/0xf0
         __mmdrop+0x21c/0x2c0
         idle_task_exit+0x170/0x1b0
         pnv_smp_cpu_kill_self+0x38/0x2e0
         cpu_die+0x48/0x64
         arch_cpu_idle_dead+0x30/0x50
         do_idle+0x2f4/0x470
         cpu_startup_entry+0x38/0x40
         start_secondary+0x7a8/0xa80
         start_secondary_resume+0x10/0x14
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pwSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6cf457d1
  20. 05 3月, 2020 2 次提交
    • M
      sched/membarrier: Fix p->mm->membarrier_state racy load · 08946ecc
      Mathieu Desnoyers 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 227a4aad
      category: bugfix
      bugzilla: 28332
      CVE: NA
      
      -------------------------------------------------
      
      The membarrier_state field is located within the mm_struct, which
      is not guaranteed to exist when used from runqueue-lock-free iteration
      on runqueues by the membarrier system call.
      
      Copy the membarrier_state from the mm_struct into the scheduler runqueue
      when the scheduler switches between mm.
      
      When registering membarrier for mm, after setting the registration bit
      in the mm membarrier state, issue a synchronize_rcu() to ensure the
      scheduler observes the change. In order to take care of the case
      where a runqueue keeps executing the target mm without swapping to
      other mm, iterate over each runqueue and issue an IPI to copy the
      membarrier_state from the mm_struct into each runqueue which have the
      same mm which state has just been modified.
      
      Move the mm membarrier_state field closer to pgd in mm_struct to use
      a cache line already touched by the scheduler switch_mm.
      
      The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
      clear the runqueue's membarrier state in addition to clear the mm
      membarrier state, so move its implementation into the scheduler
      membarrier code so it can access the runqueue structure.
      
      Add memory barrier in membarrier_exec_mmap() prior to clearing
      the membarrier state, ensuring memory accesses executed prior to exec
      are not reordered with the stores clearing the membarrier state.
      
      As suggested by Linus, move all membarrier.c RCU read-side locks outside
      of the for each cpu loops.
      
      [Cheng Jian: use task_rcu_dereference in sync_runqueues_membarrier_state]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King - ARM Linux admin <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      08946ecc
    • P
      sched: Clean up active_mm reference counting · cfd49aa0
      Peter Zijlstra 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 139d025c
      category: bugfix
      bugzilla: 28332 [preparation]
      CVE: NA
      
      -------------------------------------------------
      
      The current active_mm reference counting is confusing and sub-optimal.
      
      Rewrite the code to explicitly consider the 4 separate cases:
      
          user -> user
      
      	When switching between two user tasks, all we need to consider
      	is switch_mm().
      
          user -> kernel
      
      	When switching from a user task to a kernel task (which
      	doesn't have an associated mm) we retain the last mm in our
      	active_mm. Increment a reference count on active_mm.
      
        kernel -> kernel
      
      	When switching between kernel threads, all we need to do is
      	pass along the active_mm reference.
      
        kernel -> user
      
      	When switching between a kernel and user task, we must switch
      	from the last active_mm to the next mm, hoping of course that
      	these are the same. Decrement a reference on the active_mm.
      
      The code keeps a different order, because as you'll note, both 'to
      user' cases require switch_mm().
      
      And where the old code would increment/decrement for the 'kernel ->
      kernel' case, the new code observes this is a neutral operation and
      avoids touching the reference count.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: luto@kernel.org
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      Reviewed-By: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cfd49aa0
  21. 27 12月, 2019 2 次提交
    • P
      sched/core: Avoid spurious lock dependencies · 011e08fb
      Peter Zijlstra 提交于
      [ Upstream commit ff51ff84 ]
      
      While seemingly harmless, __sched_fork() does hrtimer_init(), which,
      when DEBUG_OBJETS, can end up doing allocations.
      
      This then results in the following lock order:
      
        rq->lock
          zone->lock.rlock
            batched_entropy_u64.lock
      
      Which in turn causes deadlocks when we do wakeups while holding that
      batched_entropy lock -- as the random code does.
      
      Solve this by moving __sched_fork() out from under rq->lock. This is
      safe because nothing there relies on rq->lock, as also evident from the
      other __sched_fork() callsite.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: bigeasy@linutronix.de
      Cc: cl@linux.com
      Cc: keescook@chromium.org
      Cc: penberg@kernel.org
      Cc: rientjes@google.com
      Cc: thgarnie@google.com
      Cc: tytso@mit.edu
      Cc: will@kernel.org
      Fixes: b7d5dc21 ("random: add a spinlock_t to struct batched_entropy")
      Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      011e08fb
    • K
      sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr() · ff4666ed
      KeMeng Shi 提交于
      [ Upstream commit 714e501e ]
      
      An oops can be triggered in the scheduler when running qemu on arm64:
      
       Unable to handle kernel paging request at virtual address ffff000008effe40
       Internal error: Oops: 96000007 [#1] SMP
       Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
       pstate: 20000085 (nzCv daIf -PAN -UAO)
       pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
       lr : move_queued_task.isra.21+0x124/0x298
       ...
       Call trace:
        __ll_sc___cmpxchg_case_acq_4+0x4/0x20
        __migrate_task+0xc8/0xe0
        migration_cpu_stop+0x170/0x180
        cpu_stopper_thread+0xec/0x178
        smpboot_thread_fn+0x1ac/0x1e8
        kthread+0x134/0x138
        ret_from_fork+0x10/0x18
      
      __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
      migrage the process if process is not currently running on any one of the
      CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
      invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
      CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
      check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
      corresponding bit is coincidentally set. As a consequence, kernel will
      access an invalid rq address associate with the invalid CPU in
      migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.
      
      The reproduce the crash:
      
        1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
        sched_setaffinity.
      
        2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
        and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.
      
        3) Oops appears if the invalid CPU is set in memory after tested cpumask.
      Signed-off-by: NKeMeng Shi <shikemeng@huawei.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ff4666ed