- 13 8月, 2022 1 次提交
-
-
由 fanrui 提交于
**Content:** HCK(High-performance Computing Kit) is a feature focusing on CPU isolation to reduce OS noise. By providing newly added kernel starting parameters to seperate all cpus into two groups, one group is managed by conventional linux, on which tasks like link interrupt and kernel thread would run, the other group is managed by HCK, on which the user can specify HPC(High Performance Computing) task to run. The goal is to reduce the performance influence of interruputs to tasks running on HCK managed cores. This feature is not enabled by default, open CONFIG_PURPOSE_BUILT_KERNEL to enable this feature.
-
- 23 5月, 2022 1 次提交
-
-
由 Igor Pylypiv 提交于
stable inclusion from stable-v5.10.102 commit de55891e162cac0ae058e05c2527fd32cc435ac0 bugzilla: https://gitee.com/openeuler/kernel/issues/I567K6 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=de55891e162cac0ae058e05c2527fd32cc435ac0 -------------------------------- [ Upstream commit 67d6212a ] This reverts commit 774a1221. We need to finish all async code before the module init sequence is done. In the reverted commit the PF_USED_ASYNC flag was added to mark a thread that called async_schedule(). Then the PF_USED_ASYNC flag was used to determine whether or not async_synchronize_full() needs to be invoked. This works when modprobe thread is calling async_schedule(), but it does not work if module dispatches init code to a worker thread which then calls async_schedule(). For example, PCI driver probing is invoked from a worker thread based on a node where device is attached: if (cpu < nr_cpu_ids) error = work_on_cpu(cpu, local_pci_probe, &ddi); else error = local_pci_probe(&ddi); We end up in a situation where a worker thread gets the PF_USED_ASYNC flag set instead of the modprobe thread. As a result, async_synchronize_full() is not invoked and modprobe completes without waiting for the async code to finish. The issue was discovered while loading the pm80xx driver: (scsi_mod.scan=async) modprobe pm80xx worker ... do_init_module() ... pci_call_probe() work_on_cpu(local_pci_probe) local_pci_probe() pm8001_pci_probe() scsi_scan_host() async_schedule() worker->flags |= PF_USED_ASYNC; ... < return from worker > ... if (current->flags & PF_USED_ASYNC) <--- false async_synchronize_full(); Commit 21c3c5d2 ("block: don't request module during elevator init") fixed the deadlock issue which the reverted commit 774a1221 ("module, async: async_synchronize_full() on module init iff async is used") tried to fix. Since commit 0fdff3ec ("async, kmod: warn on synchronous request_module() from async workers") synchronous module loading from async is not allowed. Given that the original deadlock issue is fixed and it is no longer allowed to call synchronous request_module() from async we can remove PF_USED_ASYNC flag to make module init consistently invoke async_synchronize_full() unless async module probe is requested. Signed-off-by: NIgor Pylypiv <ipylypiv@google.com> Reviewed-by: NChangyuan Lyu <changyuanl@google.com> Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 10 5月, 2022 2 次提交
-
-
由 Guan Jing 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA -------------------------------- We have added two statistics for qos smt expeller: a) nr_qos_smt_send_ipi:the times of ipi which online task expel offline tasks; b) nr_qos_smt_expelled:the statistics that offline task will not be picked times. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Guan Jing 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I52611 CVE: NA -------------------------------- We implement the function of qos smt expeller by this following two points: a)when online tasks and offline tasks are running on the same physical cpu, online tasks will send ipi to expel offline tasks on the smt sibling cpus. b)when online tasks are running, the smt sibling cpus will not allow offline tasks to be selected. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 23 2月, 2022 1 次提交
-
-
由 Peng Wu 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01 CVE: NA ------------------------------------------ Adding reliable flag for user task. User task with reliable flag can only alloc memory from mirrored region. PF_RELIABLE is added to represent the task's reliable flag. - For init task, which is regarded as special task which alloc memory from mirrored region. - For normal user tasks, The reliable flag can be set via procfs interface shown as below and can be inherited via fork(). User can change a user task's reliable flag by $ echo [0/1] > /proc/<pid>/reliable and check a user task's reliable flag by $ cat /proc/<pid>/reliable Note, global init task's reliable file can not be accessed. Signed-off-by: NPeng Wu <wupeng58@huawei.com> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 29 1月, 2022 2 次提交
-
-
由 Guan Jing 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue CVE: NA ------------------------------- We reserve some fields beforehand for sched structures prone to change, therefore, we can hot add/change features of sched with this enhancement. After reserving, normally cache does not matter as the reserved fields are not accessed at all. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-v5.16-rc1 commit bcf9033e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4Q94A CVE: NA --------------------------- THREAD_INFO_IN_TASK moved the CPU field out of thread_info, but this causes some issues on architectures that define raw_smp_processor_id() in terms of this field, due to the fact that #include'ing linux/sched.h to get at struct task_struct is problematic in terms of circular dependencies. Given that thread_info and task_struct are the same data structure anyway when THREAD_INFO_IN_TASK=y, let's move it back so that having access to the type definition of struct thread_info is sufficient to reference the CPU number of the current task. Note that this requires THREAD_INFO_IN_TASK's definition of the task_thread_info() helper to be updated, as task_cpu() takes a pointer-to-const, whereas task_thread_info() (which is used to generate lvalues as well), needs a non-const pointer. So make it a macro instead. Signed-off-by: NArd Biesheuvel <ardb@kernel.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Acked-by: NMark Rutland <mark.rutland@arm.com> Acked-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NGuan Jing <guanjing6@huawei.com> Conflicts: arch/arm64/kernel/head.S Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 31 12月, 2021 1 次提交
-
-
由 Guan Jing 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4KAP1?from=project-issue CVE: NA -------- We reserve some fields beforehand for sched structures prone to change, therefore, we can hot add/change features of sched with this enhancement. After reserving, normally cache does not matter as the reserved fields are not accessed at all. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Reviewed-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 29 12月, 2021 1 次提交
-
-
由 Marcelo Tosatti 提交于
mainline inclusion from mainline-5.14-rc1 commit a1dfb631 category: feature feature: Deep isolation bugzilla: https://gitee.com/openeuler/kernel/issues/I4N00D CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1dfb6311c7739e21e160bc4c5575a1b21b48c87 -------------------------------- When the tick dependency of a task is updated, we want it to aknowledge the new state and restart the tick if needed. If the task is not running, we don't need to kick it because it will observe the new dependency upon scheduling in. But if the task is running, we may need to send an IPI to it so that it gets notified. Unfortunately we don't have the means to check if a task is running in a race free way. Checking p->on_cpu in a synchronized way against p->tick_dep_mask would imply adding a full barrier between prepare_task_switch() and tick_nohz_task_switch(), which we want to avoid in this fast-path. Therefore we blindly fire an IPI to the task's CPU. Meanwhile we can check if the task is queued on the CPU rq because p->on_rq is always set to TASK_ON_RQ_QUEUED _before_ schedule() and its full barrier that precedes tick_nohz_task_switch(). And if the task is queued on a nohz_full CPU, it also has fair chances to be running as the isolation constraints prescribe running single tasks on full dynticks CPUs. So use this as a trick to check if we can spare an IPI toward a non-running task. NOTE: For the ordering to be correct, it is assumed that we never deactivate a task while it is running, the only exception being the task deactivating itself while scheduling out. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Acked-by: NPeter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/20210512232924.150322-9-frederic@kernel.orgSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: NChao Liu <liuchao173@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 10 12月, 2021 1 次提交
-
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4LH1X CVE: NA -------------------------------- When online tasks occupy cpu long time, offline task will not get cpu to run, the priority inversion issue may be triggered in this case. If the above case occurs, we will unthrottle offline tasks and let its get a chance to run. When online tasks occupy cpu over 5s(defaule value), we will unthrottle offline tasks and enter a msleep loop before exit to usermode util the cpu goto idle. Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com> Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 15 11月, 2021 2 次提交
-
-
由 Andrii Nakryiko 提交于
mainline inclusion from mainline-v5.15-rc1 commit c7603cfa category: bugfix bugzilla: 182996 https://gitee.com/openeuler/kernel/issues/I4DDEL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c7603cfa04e7c3a435b31d065f7cbdc829428f6e -------------------------------- b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") fixed the problem with cgroup-local storage use in BPF by pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate possible BPF program preemptions and nested executions. While this seems to work good in practice, it introduces new and unnecessary failure mode in which not all BPF programs might be executed if we fail to find an unused slot for cgroup storage, however unlikely it is. It might also not be so unlikely when/if we allow sleepable cgroup BPF programs in the future. Further, the way that cgroup storage is implemented as ambiently-available property during entire BPF program execution is a convenient way to pass extra information to BPF program and helpers without requiring user code to pass around extra arguments explicitly. So it would be good to have a generic solution that can allow implementing this without arbitrary restrictions. Ideally, such solution would work for both preemptable and sleepable BPF programs in exactly the same way. This patch introduces such solution, bpf_run_ctx. It adds one pointer field (bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of macros in such a way that it always stays valid throughout BPF program execution. BPF program preemption is handled by remembering previous current->bpf_ctx value locally while executing nested BPF program and restoring old value after nested BPF program finishes. This is handled by two helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are supposed to be used before and after BPF program runs, respectively. Restoring old value of the pointer handles preemption, while bpf_run_ctx pointer being a property of current task_struct naturally solves this problem for sleepable BPF programs by "following" BPF program execution as it is scheduled in and out of CPU. It would even allow CPU migration of BPF programs, even though it's not currently allowed by BPF infra. This patch cleans up cgroup local storage handling as a first application. The design itself is generic, though, with bpf_run_ctx being an empty struct that is supposed to be embedded into a specific struct for a given BPF program type (bpf_cg_run_ctx in this case). Follow up patches are planned that will expand this mechanism for other uses within tracing BPF programs. To verify that this change doesn't revert the fix to the original cgroup storage issue, I ran the same repro as in the original report ([0]) and didn't get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work). [0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/ Fixes: b910eaaa ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper") Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NYonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org Conflicts: include/linux/bpf.h include/linux/sched.h kernel/fork.c net/bpf/test_run.c Signed-off-by: NPu Lehui <pulehui@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Peter Zijlstra 提交于
stable inclusion from stable-5.10.74 commit bb893f075431e97dd72cde0721957a73d26578a8 bugzilla: 182986 https://gitee.com/openeuler/kernel/issues/I4I3MG Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bb893f075431e97dd72cde0721957a73d26578a8 -------------------------------- [ Upstream commit 83d40a61 ] vmlinux.o: warning: objtool: check_preemption_disabled()+0x81: call to is_percpu_thread() leaves .noinstr.text section Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20210928084218.063371959@infradead.orgSigned-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 21 10月, 2021 1 次提交
-
-
由 Tony Luck 提交于
stable inclusion from stable-5.10.68 commit 619d747c1850bab61625ca9d8b4730f470a5947b bugzilla: 182671 https://gitee.com/openeuler/kernel/issues/I4EWUH Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=619d747c1850bab61625ca9d8b4730f470a5947b -------------------------------- commit 81065b35 upstream. There are two cases for machine check recovery: 1) The machine check was triggered by ring3 (application) code. This is the simpler case. The machine check handler simply queues work to be executed on return to user. That code unmaps the page from all users and arranges to send a SIGBUS to the task that triggered the poison. 2) The machine check was triggered in kernel code that is covered by an exception table entry. In this case the machine check handler still queues a work entry to unmap the page, etc. but this will not be called right away because the #MC handler returns to the fix up code address in the exception table entry. Problems occur if the kernel triggers another machine check before the return to user processes the first queued work item. Specifically, the work is queued using the ->mce_kill_me callback structure in the task struct for the current thread. Attempting to queue a second work item using this same callback results in a loop in the linked list of work functions to call. So when the kernel does return to user, it enters an infinite loop processing the same entry for ever. There are some legitimate scenarios where the kernel may take a second machine check before returning to the user. 1) Some code (e.g. futex) first tries a get_user() with page faults disabled. If this fails, the code retries with page faults enabled expecting that this will resolve the page fault. 2) Copy from user code retries a copy in byte-at-time mode to check whether any additional bytes can be copied. On the other side of the fence are some bad drivers that do not check the return value from individual get_user() calls and may access multiple user addresses without noticing that some/all calls have failed. Fix by adding a counter (current->mce_count) to keep track of repeated machine checks before task_work() is called. First machine check saves the address information and calls task_work_add(). Subsequent machine checks before that task_work call back is executed check that the address is in the same page as the first machine check (since the callback will offline exactly one page). Expected worst case is four machine checks before moving on (e.g. one user access with page faults disabled, then a repeat to the same address with page faults enabled ... repeat in copy tail bytes). Just in case there is some code that loops forever enforce a limit of 10. [ bp: Massage commit message, drop noinstr, fix typo, extend panic messages. ] Fixes: 5567d11c ("x86/mce: Send #MC singal from task work") Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YT/IJ9ziLqmtqEPu@agluck-desk2.amr.corp.intel.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 28 7月, 2021 1 次提交
-
-
由 Peter Zijlstra (Intel) 提交于
mainline inclusion from mainline-5.12-rc1 commit b965f1dd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I410UT CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b965f1ddb47daa5b8b2e2bc9c921431236830367 --------------------------- Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT) and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we can override their behaviour when preempt= is overriden. Since the default behaviour is full preemption, both their calls are ignored when preempt= isn't passed. [fweisbec: branch might_resched() directly to __cond_resched(), only define static calls when PREEMPT_DYNAMIC] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.orgSigned-off-by: NMa Junhai <majunhai2@huawei.com> Conflicts: include/linux/kernel.h Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 14 7月, 2021 1 次提交
-
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- If online tasks occupy 100% CPU resources, offline tasks can't be scheduled since offline tasks are throttled, as a result, offline task can't timely respond after receiving SIGKILL signal. Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 03 7月, 2021 1 次提交
-
-
由 Dietmar Eggemann 提交于
stable inclusion from stable-5.10.44 commit 190a7f908993cc7f5e8bbe67aa88eeb0bdbbbaa5 bugzilla: 109295 CVE: NA -------------------------------- commit 68d7a190 upstream. The util_est internal UTIL_AVG_UNCHANGED flag which is used to prevent unnecessary util_est updates uses the LSB of util_est.enqueued. It is exposed via _task_util_est() (and task_util_est()). Commit 92a801e5 ("sched/fair: Mask UTIL_AVG_UNCHANGED usages") mentions that the LSB is lost for util_est resolution but find_energy_efficient_cpu() checks if task_util_est() returns 0 to return prev_cpu early. _task_util_est() returns the max value of util_est.ewma and util_est.enqueued or'ed w/ UTIL_AVG_UNCHANGED. So task_util_est() returning the max of task_util() and _task_util_est() will never return 0 under the default SCHED_FEAT(UTIL_EST, true). To fix this use the MSB of util_est.enqueued instead and keep the flag util_est internal, i.e. don't export it via _task_util_est(). The maximal possible util_avg value for a task is 1024 so the MSB of 'unsigned int util_est.enqueued' isn't used to store a util value. As a caveat the code behind the util_est_se trace point has to filter UTIL_AVG_UNCHANGED to see the real util_est.enqueued value which should be easy to do. This also fixes an issue report by Xuewen Yan that util_est_update() only used UTIL_AVG_UNCHANGED for the subtrahend of the equation: last_enqueued_diff = ue.enqueued - (task_util() | UTIL_AVG_UNCHANGED) Fixes: b89997aa sched/pelt: Fix task util_est update filtering Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NXuewen Yan <xuewen.yan@unisoc.com> Reviewed-by: NVincent Donnefort <vincent.donnefort@arm.com> Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org> Link: https://lore.kernel.org/r/20210602145808.1562603-1-dietmar.eggemann@arm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 09 4月, 2021 1 次提交
-
-
由 Yang Yingliang 提交于
hulk inclusion category: feature feature: ARM MPAM support bugzilla: 48265 CVE: NA -------------------------------- Build basic framework for mpam. Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com> Reviewed-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 07 1月, 2021 1 次提交
-
-
由 Yury Norov 提交于
maillist inclusion category: feature bugzilla: 46790 CVE: NA Reference: https://github.com/norov/linux/commits/ilp32-5.2 -------------------------------- Thread bits may be accessed from low-level code, so isolating is a measure to avoid circular dependencies in header files. The exact reason for circular dependency is WARN_ON() macro added in patch edd63a27 "set_restore_sigmask() is never called without SIGPENDING (and never should be)" Signed-off-by: NYury Norov <ynorov@caviumnetworks.com> Signed-off-by: NYury Norov <ynorov@marvell.com> Conflicts: include/linux/sched.h [wangxiongfeng: small context conflicts because of inclusion of header file] Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Acked-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com>
-
- 17 11月, 2020 2 次提交
-
-
由 Juri Lelli 提交于
Glenn reported that "an application [he developed produces] a BUG in deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS task that was boosted by a SCHED_DEADLINE task boosts another CFS task (nested priority inheritance). ------------[ cut here ]------------ kernel BUG at kernel/sched/deadline.c:1462! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ... Hardware name: ... RIP: 0010:enqueue_task_dl+0x335/0x910 Code: ... RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002 RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500 RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600 RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078 R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600 R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600 FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: ? intel_pstate_update_util_hwp+0x13/0x170 rt_mutex_setprio+0x1cc/0x4b0 task_blocks_on_rt_mutex+0x225/0x260 rt_spin_lock_slowlock_locked+0xab/0x2d0 rt_spin_lock_slowlock+0x50/0x80 hrtimer_grab_expiry_lock+0x20/0x30 hrtimer_cancel+0x13/0x30 do_nanosleep+0xa0/0x150 hrtimer_nanosleep+0xe1/0x230 ? __hrtimer_init_sleeper+0x60/0x60 __x64_sys_nanosleep+0x8d/0xa0 do_syscall_64+0x4a/0x100 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa58b52330d ... ---[ end trace 0000000000000002 ]— He also provided a simple reproducer creating the situation below: So the execution order of locking steps are the following (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2 are mutexes that are enabled * with priority inheritance.) Time moves forward as this timeline goes down: N1 N2 D1 | | | | | | Lock(M1) | | | | | | Lock(M2) | | | | | | Lock(M2) | | | | Lock(M1) | | (!!bug triggered!) | Daniel reported a similar situation as well, by just letting ksoftirqd run with DEADLINE (and eventually block on a mutex). Problem is that boosted entities (Priority Inheritance) use static DEADLINE parameters of the top priority waiter. However, there might be cases where top waiter could be a non-DEADLINE entity that is currently boosted by a DEADLINE entity from a different lock chain (i.e., nested priority chains involving entities of non-DEADLINE classes). In this case, top waiter static DEADLINE parameters could be null (initialized to 0 at fork()) and replenish_dl_entity() would hit a BUG(). Fix this by keeping track of the original donor and using its parameters when a task is boosted. Reported-by: NGlenn Elliott <glenn@aurora.tech> Reported-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NJuri Lelli <juri.lelli@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com
-
由 Peter Zijlstra 提交于
Mel reported that on some ARM64 platforms loadavg goes bananas and Will tracked it down to the following race: CPU0 CPU1 schedule() prev->sched_contributes_to_load = X; deactivate_task(prev); try_to_wake_up() if (p->on_rq &&) // false if (smp_load_acquire(&p->on_cpu) && // true ttwu_queue_wakelist()) p->sched_remote_wakeup = Y; smp_store_release(prev->on_cpu, 0); where both p->sched_contributes_to_load and p->sched_remote_wakeup are in the same word, and thus the stores X and Y race (and can clobber one another's data). Whereas prior to commit c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") the p->on_cpu handoff serialized access to p->sched_remote_wakeup (just as it still does with p->sched_contributes_to_load) that commit broke that by calling ttwu_queue_wakelist() with p->on_cpu != 0. However, due to p->XXX = X ttwu() schedule() if (p->on_rq && ...) // false smp_mb__after_spinlock() if (smp_load_acquire(&p->on_cpu) && deactivate_task() ttwu_queue_wakelist()) p->on_rq = 0; p->sched_remote_wakeup = Y; We can be sure any 'current' store is complete and 'current' is guaranteed asleep. Therefore we can move p->sched_remote_wakeup into the current flags word. Note: while the observed failure was loadavg accounting gone wrong due to ttwu() cobbering p->sched_contributes_to_load, the reverse problem is also possible where schedule() clobbers p->sched_remote_wakeup, this could result in enqueue_entity() wrecking ->vruntime and causing scheduling artifacts. Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: NMel Gorman <mgorman@techsingularity.net> Debugged-by: NWill Deacon <will@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201117083016.GK3121392@hirez.programming.kicks-ass.net
-
- 17 10月, 2020 1 次提交
-
-
由 Elena Petrova 提交于
in_ubsan field of task_struct is only used in lib/ubsan.c, which in its turn is used only `ifneq ($(CONFIG_UBSAN_TRAP),y)`. Removing unnecessary field from a task_struct will help preserve the ABI between vanilla and CONFIG_UBSAN_TRAP'ed kernels. In particular, this will help enabling bounds sanitizer transparently for Android's GKI. Signed-off-by: NElena Petrova <lenaptr@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NKees Cook <keescook@chromium.org> Cc: Jann Horn <jannh@google.com> Link: https://lkml.kernel.org/r/20200910134802.3160311-1-lenaptr@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 10月, 2020 1 次提交
-
-
由 Patricia Alfonso 提交于
Patch series "KASAN-KUnit Integration", v14. This patchset contains everything needed to integrate KASAN and KUnit. KUnit will be able to: (1) Fail tests when an unexpected KASAN error occurs (2) Pass tests when an expected KASAN error occurs Convert KASAN tests to KUnit with the exception of copy_user_test because KUnit is unable to test those. Add documentation on how to run the KASAN tests with KUnit and what to expect when running these tests. This patch (of 5): In order to integrate debugging tools like KASAN into the KUnit framework, add KUnit struct to the current task to keep track of the current KUnit test. Signed-off-by: NPatricia Alfonso <trishalfonso@google.com> Signed-off-by: NDavid Gow <davidgow@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Tested-by: NAndrey Konovalov <andreyknvl@google.com> Reviewed-by: NBrendan Higgins <brendanhiggins@google.com> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Link: https://lkml.kernel.org/r/20200915035828.570483-1-davidgow@google.com Link: https://lkml.kernel.org/r/20200915035828.570483-2-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-1-davidgow@google.com Link: https://lkml.kernel.org/r/20200910070331.3358048-2-davidgow@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 10月, 2020 1 次提交
-
-
由 Tony Luck 提交于
Existing kernel code can only recover from a machine check on code that is tagged in the exception table with a fault handling recovery path. Add two new fields in the task structure to pass information from machine check handler to the "task_work" that is queued to run before the task returns to user mode: + mce_vaddr: will be initialized to the user virtual address of the fault in the case where the fault occurred in the kernel copying data from a user address. This is so that kill_me_maybe() can provide that information to the user SIGBUS handler. + mce_kflags: copy of the struct mce.kflags needed by kill_me_maybe() to determine if mce_vaddr is applicable to this error. Add code to recover from a machine check while copying data from user space to the kernel. Action for this case is the same as if the user touched the poison directly; unmap the page and send a SIGBUS to the task. Use a new helper function to share common code between the "fault in user mode" case and the "fault while copying from user" case. New code paths will be activated by the next patch which sets MCE_IN_KERNEL_COPYIN. Suggested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20201006210910.21062-6-tony.luck@intel.com
-
- 03 10月, 2020 1 次提交
-
-
由 Vincent Donnefort 提交于
rq->cpu_capacity is a key element in several scheduler parts, such as EAS task placement and load balancing. Tracking this value enables testing and/or debugging by a toolkit. Signed-off-by: NVincent Donnefort <vincent.donnefort@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com
-
- 01 10月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
Grab actual references to the files_struct. To avoid circular references issues due to this, we add a per-task note that keeps track of what io_uring contexts a task has used. When the tasks execs or exits its assigned files, we cancel requests based on this tracking. With that, we can grab proper references to the files table, and no longer need to rely on stashing away ring_fd and ring_file to check if the ring_fd may have been closed. Cc: stable@vger.kernel.org # v5.5+ Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 8月, 2020 2 次提交
-
-
The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in sched_submit_work() which is considered to be a hot path. If the two bits cross the 8 or 16 bit boundary then most architecture require multiple load instructions in order to create the constant value. Also, such a value can not be encoded within the compare opcode. By moving the bit definition within the same block, the compiler can create/use one immediate value. For some reason gcc-10 on ARM64 requires both bits to be next to each other in order to issue "tst reg, val; bne label". Otherwise the result is "mov reg1, val; tst reg, reg1; bne label". Move PF_VCPU out of the way so that PF_IO_WORKER can be next to PF_WQ_WORKER. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200819195505.y3fxk72sotnrkczi@linutronix.de
-
由 Marco Elver 提交于
is_idle_task() may be used from noinstr functions such as irqentry_enter(). Since the compiler is free to not inline regular inline functions, switch to using __always_inline. Signed-off-by: NMarco Elver <elver@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200820172046.GA177701@elver.google.com
-
- 06 8月, 2020 2 次提交
-
-
由 Thomas Gleixner 提交于
Running posix CPU timers in hard interrupt context has a few downsides: - For PREEMPT_RT it cannot work as the expiry code needs to take sighand lock, which is a 'sleeping spinlock' in RT. The original RT approach of offloading the posix CPU timer handling into a high priority thread was clumsy and provided no real benefit in general. - For fine grained accounting it's just wrong to run this in context of the timer interrupt because that way a process specific CPU time is accounted to the timer interrupt. - Long running timer interrupts caused by a large amount of expiring timers which can be created and armed by unpriviledged user space. There is no hard requirement to expire them in interrupt context. If the signal is targeted at the task itself then it won't be delivered before the task returns to user space anyway. If the signal is targeted at a supervisor process then it might be slightly delayed, but posix CPU timers are inaccurate anyway due to the fact that they are tied to the tick. Provide infrastructure to schedule task work which allows splitting the posix CPU timer code into a quick check in interrupt context and a thread context expiry and signal delivery function. This has to be enabled by architectures as it requires that the architecture specific KVM implementation handles pending task work before exiting to guest mode. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20200730102337.783470146@linutronix.de
-
由 Peter Zijlstra 提交于
By using lockdep_assert_*() from seqlock.h, the spaghetti monster attacked. Attack back by reducing seqlock.h dependencies from two key high level headers: - <linux/seqlock.h>: -Remove <linux/ww_mutex.h> - <linux/time.h>: -Remove <linux/seqlock.h> - <linux/sched.h>: +Add <linux/seqlock.h> The price was to add it to sched.h ... Core header fallout, we add direct header dependencies instead of gaining them parasitically from higher level headers: - <linux/dynamic_queue_limits.h>: +Add <asm/bug.h> - <linux/hrtimer.h>: +Add <linux/seqlock.h> - <linux/ktime.h>: +Add <asm/bug.h> - <linux/lockdep.h>: +Add <linux/smp.h> - <linux/sched.h>: +Add <linux/seqlock.h> - <linux/videodev2.h>: +Add <linux/kernel.h> Arch headers fallout: - PARISC: <asm/timex.h>: +Add <asm/special_insns.h> - SH: <asm/io.h>: +Add <asm/page.h> - SPARC: <asm/timer_64.h>: +Add <uapi/asm/asi.h> - SPARC: <asm/vvar.h>: +Add <asm/processor.h>, <asm/barrier.h> -Remove <linux/seqlock.h> - X86: <asm/fixmap.h>: +Add <asm/pgtable_types.h> -Remove <asm/acpi.h> There's also a bunch of parasitic header dependency fallout in .c files, not listed separately. [ mingo: Extended the changelog, split up & fixed the original patch. ] Co-developed-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200804133438.GK2674@hirez.programming.kicks-ass.net
-
- 31 7月, 2020 2 次提交
-
-
由 Marco Elver 提交于
To improve the general usefulness of the IRQ state trace events with KCSAN enabled, save and restore the trace information when entering and exiting the KCSAN runtime as well as when generating a KCSAN report. Without this, reporting the IRQ trace events (whether via a KCSAN report or outside of KCSAN via a lockdep report) is rather useless due to continuously being touched by KCSAN. This is because if KCSAN is enabled, every instrumented memory access causes changes to IRQ trace events (either by KCSAN disabling/enabling interrupts or taking report_lock when generating a report). Before "lockdep: Prepare for NMI IRQ state tracking", KCSAN avoided touching the IRQ trace events via raw_local_irq_save/restore() and lockdep_off/on(). Fixes: 248591f5 ("kcsan: Make KCSAN compatible with new IRQ state tracking") Signed-off-by: NMarco Elver <elver@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200729110916.3920464-2-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Marco Elver 提交于
Refactor the IRQ trace events fields, used for printing information about the IRQ trace events, into a separate struct 'irqtrace_events'. This improves readability by separating the information only used in reporting, as well as enables (simplified) storing/restoring of irqtrace_events snapshots. No functional change intended. Signed-off-by: NMarco Elver <elver@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 7月, 2020 2 次提交
-
-
由 Ahmed S. Darwish 提交于
A sequence counter write side critical section must be protected by some form of locking to serialize writers. A plain seqcount_t does not contain the information of which lock must be held when entering a write side critical section. Use the new seqcount_spinlock_t data type, which allows to associate a spinlock with the sequence counter. This enables lockdep to verify that the spinlock used for writer serialization is held when the write side critical section is entered. If lockdep is disabled this lock association is compiled out and has neither storage size nor runtime overhead. Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de
-
由 Qais Yousef 提交于
RT tasks by default run at the highest capacity/performance level. When uclamp is selected this default behavior is retained by enforcing the requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum value. This is also referred to as 'the default boost value of RT tasks'. See commit 1a00d999 ("sched/uclamp: Set default clamps for RT tasks"). On battery powered devices, it is desired to control this default (currently hardcoded) behavior at runtime to reduce energy consumed by RT tasks. For example, a mobile device manufacturer where big.LITTLE architecture is dominant, the performance of the little cores varies across SoCs, and on high end ones the big cores could be too power hungry. Given the diversity of SoCs, the new knob allows manufactures to tune the best performance/power for RT tasks for the particular hardware they run on. They could opt to further tune the value when the user selects a different power saving mode or when the device is actively charging. The runtime aspect of it further helps in creating a single kernel image that can be run on multiple devices that require different tuning. Keep in mind that a lot of RT tasks in the system are created by the kernel. On Android for instance I can see over 50 RT tasks, only a handful of which created by the Android framework. To control the default behavior globally by system admins and device integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default to change the default boost value of the RT tasks. I anticipate this to be mostly in the form of modifying the init script of a particular device. To avoid polluting the fast path with unnecessary code, the approach taken is to synchronously do the update by traversing all the existing tasks in the system. This could race with a concurrent fork(), which is dealt with by introducing sched_post_fork() function which will ensure the racy fork will get the right update applied. Tested on Juno-r2 in combination with the RT capacity awareness [1]. By default an RT task will go to the highest capacity CPU and run at the maximum frequency, which is particularly energy inefficient on high end mobile devices because the biggest core[s] are 'huge' and power hungry. With this patch the RT task can be controlled to run anywhere by default, and doesn't cause the frequency to be maximum all the time. Yet any task that really needs to be boosted can easily escape this default behavior by modifying its requested uclamp.min value (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall. [1] 804d402f: ("sched/rt: Make RT capacity-aware") Signed-off-by: NQais Yousef <qais.yousef@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com
-
- 28 7月, 2020 1 次提交
-
-
由 王文虎 提交于
Change the comment typo: "direcly" -> "directly". Signed-off-by: NWang Wenhu <wenhu.wang@vivo.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/AAcAXwBTDSpsKN-5iyIOtaqk.1.1595857191899.Hmail.wenhu.wang@vivo.com
-
- 22 7月, 2020 1 次提交
-
-
由 Peter Zijlstra 提交于
Dave hit the problem fixed by commit: b6e13e85 ("sched/core: Fix ttwu() race") and failed to understand much of the code involved. Per his request a few comments to (hopefully) clarify things. Requested-by: NDave Chinner <david@fromorbit.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200702125211.GQ4800@hirez.programming.kicks-ass.net
-
- 10 7月, 2020 1 次提交
-
-
由 Peter Zijlstra 提交于
Currently all IRQ-tracking state is in task_struct, this means that task_struct needs to be defined before we use it. Especially for lockdep_assert_irq*() this can lead to header-hell. Move the hardirq state into per-cpu variables to avoid the task_struct dependency. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org
-
- 08 7月, 2020 3 次提交
-
-
由 Phil Auld 提交于
Add a bare tracepoint trace_sched_update_nr_running_tp which tracks ->nr_running CPU's rq. This is used to accurately trace this data and provide a visualization of scheduler imbalances in, for example, the form of a heat map. The tracepoint is accessed by loading an external kernel module. An example module (forked from Qais' module and including the pelt related tracepoints) can be found at: https://github.com/auldp/tracepoints-helpers.git A script to turn the trace-cmd report output into a heatmap plot can be found at: https://github.com/jirvoz/plot-nr-running The tracepoints are added to add_nr_running() and sub_nr_running() which are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in the header a wrapper call is used and the trace/events/sched.h include is moved before sched.h in kernel/sched/core. Signed-off-by: NPhil Auld <pauld@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200629192303.GC120228@lorien.usersys.redhat.com
-
由 Peter Zijlstra 提交于
The recent commit: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") moved these lines in ttwu(): p->sched_contributes_to_load = !!task_contributes_to_load(p); p->state = TASK_WAKING; up before: smp_cond_load_acquire(&p->on_cpu, !VAL); into the 'p->on_rq == 0' block, with the thinking that once we hit schedule() the current task cannot change it's ->state anymore. And while this is true, it is both incorrect and flawed. It is incorrect in that we need at least an ACQUIRE on 'p->on_rq == 0' to avoid weak hardware from re-ordering things for us. This can fairly easily be achieved by relying on the control-dependency already in place. The second problem, which makes the flaw in the original argument, is that while schedule() will not change prev->state, it will read it a number of times (arguably too many times since it's marked volatile). The previous condition 'p->on_cpu == 0' was sufficient because that indicates schedule() has completed, and will no longer read prev->state. So now the trick is to make this same true for the (much) earlier 'prev->on_rq == 0' case. Furthermore, in order to make the ordering stick, the 'prev->on_rq = 0' assignment needs to he a RELEASE, but adding additional ordering to schedule() is an unwelcome proposition at the best of times, doubly so for mere accounting. Luckily we can push the prev->state load up before rq->lock, with the only caveat that we then have to re-read the state after. However, we know that if it changed, we no longer have to worry about the blocking path. This gives us the required ordering, if we block, we did the prev->state load before an (effective) smp_mb() and the p->on_rq store needs not change. With this we end up with the effective ordering: LOAD p->state LOAD-ACQUIRE p->on_rq == 0 MB STORE p->on_rq, 0 STORE p->state, TASK_WAKING which ensures the TASK_WAKING store happens after the prev->state load, and all is well again. Fixes: c6e7bd7a ("sched/core: Optimize ttwu() spinning on p->on_cpu") Reported-by: NDave Jones <davej@codemonkey.org.uk> Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDave Jones <davej@codemonkey.org.uk> Tested-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Link: https://lkml.kernel.org/r/20200707102957.GN117543@hirez.programming.kicks-ass.net
-
由 Eric W. Biederman 提交于
The bpfilter code no longer uses the umd_info.cleanup callback. This callback is what exit_umh exists to call. So remove exit_umh and all of it's associated booking. v1: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87y2o53abg.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-15-ebiederm@xmission.comReviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-
- 04 7月, 2020 1 次提交
-
-
由 Eric W. Biederman 提交于
This makes it clear which code is part of the core user mode helper support and which code is needed to implement user mode drivers. This makes the kernel smaller for everyone who does not use a usermode driver. v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.comReviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: NAlexei Starovoitov <ast@kernel.org> Tested-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
-