- 30 11月, 2022 40 次提交
-
-
由 Luo Meng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5WBID CVE: NA -------------------------------- When dm_resume() and dm_destroy() are concurrent, it will lead to UAF. One of the concurrency UAF can be shown as below: use free do_resume | __find_device_hash_cell | dm_get | atomic_inc(&md->holders) | | dm_destroy | __dm_destroy | if (!dm_suspended_md(md)) | atomic_read(&md->holders) | msleep(1) dm_resume | __dm_resume | dm_table_resume_targets | pool_resume | do_waker #add delay work | | dm_table_destroy | pool_dtr | __pool_dec | __pool_destroy | destroy_workqueue | kfree(pool) # free pool time out __do_softirq run_timer_softirq # pool has already been freed This can be easily reproduced using: 1. create thin-pool 2. dmsetup suspend pool 3. dmsetup resume pool 4. dmsetup remove_all # Concurrent with 3 The root cause of UAF bugs is that dm_resume() adds timer after dm_destroy() skips cancel timer beause of suspend status. After timeout, it will call run_timer_softirq(), however pool has already been freed. The concurrency UAF bug will happen. Therefore, canceling timer is moved after md->holders is zero. Signed-off-by: NLuo Meng <luomeng12@huawei.com> Reviewed-by: NHou Tao <houtao1@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I60N44 CVE: NA -------------------------------- Misspelling of 'CONFIG_PREEMPTION' may cause old function not being checked, which results in a running function being livepatched. Fixes: 20106abf ("livepatch: Check whole stack when CONFIG_PREEMPT is set") Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 David Vernet 提交于
mainline inclusion from mainline-v5.17-rc1 commit f5bdb34b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60MYE CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5bdb34bf0c9314548f2d8e2360b703ff3610303 -------------------------------- When initializing a 'struct klp_object' in klp_init_object_loaded(), and performing relocations in klp_resolve_symbols(), klp_find_object_symbol() is invoked to look up the address of a symbol in an already-loaded module (or vmlinux). This, in turn, calls kallsyms_on_each_symbol() or module_kallsyms_on_each_symbol() to find the address of the symbol that is being patched. It turns out that symbol lookups often take up the most CPU time when enabling and disabling a patch, and may hog the CPU and cause other tasks on that CPU's runqueue to starve -- even in paths where interrupts are enabled. For example, under certain workloads, enabling a KLP patch with many objects or functions may cause ksoftirqd to be starved, and thus for interrupts to be backlogged and delayed. This may end up causing TCP retransmits on the host where the KLP patch is being applied, and in general, may cause any interrupts serviced by softirqd to be delayed while the patch is being applied. So as to ensure that kallsyms_on_each_symbol() does not end up hogging the CPU, this patch adds a call to cond_resched() in kallsyms_on_each_symbol() and module_kallsyms_on_each_symbol(), which are invoked when doing a symbol lookup in vmlinux and a module respectively. Without this patch, if a live-patch is applied on a 36-core Intel host with heavy TCP traffic, a ~10x spike is observed in TCP retransmits while the patch is being applied. Additionally, collecting sched events with perf indicates that ksoftirqd is awakened ~1.3 seconds before it's eventually scheduled. With the patch, no increase in TCP retransmit events is observed, and ksoftirqd is scheduled shortly after it's awakened. Signed-off-by: NDavid Vernet <void@manifault.com> Acked-by: NMiroslav Benes <mbenes@suse.cz> Acked-by: NSong Liu <song@kernel.org> Signed-off-by: NPetr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20211229215646.830451-1-void@manifault.comSigned-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60MKD CVE: NA -------------------------------- Fix several code style issues: - Do not use magic numbers.The number is 10 - Do not use parentheses when printing numbers. - Braces {} are not necessary for single statement blocks - Do not add blank lines on the start of a code block defined by braces. Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10 CVE: NA -------------------------------- static call and static key allow user to modify instructions on call site, relate configs are: CONFIG_HAVE_STATIC_CALL_INLINE for static call, CONFIG_JUMP_LABEL for static key. When they exist in first several instruction of an old function, and livepatch could also modify there, then confliction happened. To avoid the confliction, we don't allow a livepatch module of this case to be inserted. Fixes: c33e4283 ("livepatch/core: Allow implementation without ftrace") Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10 CVE: NA -------------------------------- If a function is patched, instructions at the beginning are modified to be 'jump codes' which jump to new function. This requires the function be big enough, otherwise the modification may be out of function range. Currently each architecture needs to implement arch_klp_func_can_patch() to check function size. However, there exists following problems: 1. arch 'x86' didn't implement arch_klp_func_can_patch(); 2. implementations in arm64 & ppc32, function size is checked only if there's a long jump. There is a scenario where a very short function is successfully patched, but as kernel module increases, someday long jump is required, then the function become unable to be patched. 3. implementaions look like duplicate. In this patch, introduce macro KLP_MAX_REPLACE_SIZE to denote the maximum size that will be replaced on patching, then move the check ahead into klp_init_object_loaded(). Fixes: c33e4283 ("livepatch/core: Allow implementation without ftrace") Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10 CVE: NA -------------------------------- In arm/arm64/ppc32/ppc64, this field is named as old_insns, so uniform it. Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Yejian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10 CVE: NA -------------------------------- It was reported that if 'static_call' is used in a old function, then the livepatch module created by kpatch for that old function cannot be inserted normally. Root cause is that relocation of static_call symbols in livepatch module has not been done while initing: load_module prepare_coming_module blocking_notifier_call_chain_robust notifier_call_chain_robust static_call_module_notify <-- 1. static_call symbols init here, but relocation is done at below MARK "2." do_init_module do_one_initcall klp_register_patch klp_init_patch klp_init_object klp_init_object_loaded <-- 2. relocate .klp.xxx here To solve it, we move the static_call initialization after relocation. Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com> Reviewed-by: NKuohai Xu <xukuohai@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Luo Meng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5P05D CVE: NA -------------------------------- When thinpool is suspended and sets fail_io, resume will report error as below: device-mapper: resume ioctl on vg-thinpool failed: Invalid argument Thinpool also can't be removed if bio is in deferred list. This can be easily reproduced using: echo "offline" > /sys/block/sda/device/state dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1 dmsetup suspend /dev/mapper/pool mkfs.ext4 /dev/mapper/thin dmsetup resume /dev/mapper/pool The root cause is maybe_resize_data_dev() will check fail_io and return error before called dm_resume. Fix this by adding FAIL mode check at the end of pool_preresume(). Fixes: da105ed5 (dm thin metadata: introduce dm_pool_abort_metadata) Signed-off-by: NLuo Meng <luomeng12@huawei.com> Reviewed-by: NHou Tao <houtao1@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Luo Meng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62762 CVE: NA -------------------------------- A crash as follows: BUG: KASAN: null-ptr-deref in dev_create.cold+0x12/0x77 Read of size 8 at addr 0000000000000020 by task dmsetup/683 CPU: 4 PID: 683 Comm: dmsetup Not tainted 5.10.0-01524-g884de6e91114-dirty #11 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 Call Trace: ? dump_stack+0xdd/0x126 ? kasan_report.cold+0xd1/0xdb ? dev_create.cold+0x12/0x77 ? __asan_load8+0xae/0x110 ? dev_create.cold+0x12/0x77 ? dev_rename+0x720/0x720 ? cap_capable+0xcf/0x130 ? ctl_ioctl+0x2f5/0x750 ? dev_rename+0x720/0x720 ? free_params+0x50/0x50 ? unmerge_queues+0x176/0x1b0 ? __blkcg_punt_bio_submit+0x110/0x110 ? mem_cgroup_handle_over_high+0x33/0x5e0 ? dm_ctl_ioctl+0x12/0x20 ? __se_sys_ioctl+0xc5/0x120 ? __x64_sys_ioctl+0x46/0x60 ? do_syscall_64+0x45/0x70 ? entry_SYSCALL_64_after_hwframe+0x61/0xc6 This can be easily reproduced using: dmsetup create xxx --table "0 1000 linear /dev/sda 0" dmsetup remove xxx Fix this by adding hass_lock in dev_create(). Fixes: a5100d07 ("dm ioctl: add DMINFO() to track dm device create/remove") Signed-off-by: NLuo Meng <luomeng12@huawei.com> Reviewed-by: NHou Tao <houtao1@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Vladimir Murzin 提交于
mainline inclusion from mainline-v5.16-rc7 commit 7202216a category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7202216a6f34d571a22274e729f841256bf8b1ef -------------------------------- __secondary_data used to reside in r7 around call to PROCINFO_INITFUNC. After commit 95731b8e ("ARM: 9059/1: cache-v7: get rid of mini-stack") r7 is used as a scratch register, so we have to reload __secondary_data before we setup the stack pointer. conflict: arch/arm/kernel/head-nommu.S Fixes: 95731b8e ("ARM: 9059/1: cache-v7: get rid of mini-stack") Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com> Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-v5.13-rc1 commit 95731b8e category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95731b8ee63ec9419822a51cd9878fa32582fdd2 -------------------------------- Now that we have reduced the number of registers that we need to preserve when calling v7_invalidate_l1 from the boot code, we can use scratch registers to preserve the remaining ones, and get rid of the mini stack entirely. This works around any issues regarding cache behavior in relation to the uncached accesses to this memory, which is hard to get right in the general case (i.e., both bare metal and under virtualization) While at it, switch v7_invalidate_l1 to using ip as a scratch register instead of r4. This makes the function AAPCS compliant, and removes the need to stash r4 in ip across the call. conflict: arch/arm/include/asm/memory.h Acked-by: NNicolas Pitre <nico@fluxnic.net> Signed-off-by: NArd Biesheuvel <ardb@kernel.org> Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-v5.13-rc1 commit f9e7a99f category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f9e7a99fb6b86aa6a00e53b34ee6973840e005aa -------------------------------- The cache invalidation code in v7_invalidate_l1 can be tweaked to re-read the associativity from CCSIDR, and keep the way identifier component in a single register that is assigned in the outer loop. This way, we need 2 registers less. Given that the number of sets is typically much larger than the associativity, rearrange the code so that the outer loop has the fewer number of iterations, ensuring that the re-read of CCSIDR only occurs a handful of times in practice. Fix the whitespace while at it, and update the comment to indicate that this code is no longer a clone of anything else. Acked-by: NNicolas Pitre <nico@fluxnic.net> Signed-off-by: NArd Biesheuvel <ardb@kernel.org> Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk> Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com> Reviewed-by: NLiao Chang <liaochang1@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Quan Zhou 提交于
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L CVE: NA ---------------------------------------------------- Implement the capability of DVMBM. Before each vcpu is loaded, we re-calculate the VM-wide dvm_cpumask, and if it's changed we will kick all other vcpus out to reload the latest LSUDVMBM value to the register, and a new request KVM_REQ_RELOAD_DVMBM is added to implement this. Otherwise if the dvm_cpumask is not changed by this single vcpu, in order to ensure the correctness of the contents in the register, we reload the LSUDVMBM value to the register and nothing else will be done. Signed-off-by: NQuan Zhou <zhouquan65@huawei.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NNianyao Tang <tangnianyao@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Quan Zhou 提交于
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L CVE: NA ---------------------------------------------------- Introduce dvm_cpumask and dvm_lock in struct kvm_arch. dvm_cpumask will store the union of all vcpus' cpus_ptr and will be used for the TLBI broadcast range. dvm_lock ensures a exclusive manipulation of dvm_cpumask. In vcpu_load, we should decide whether to perform the subsequent update operation by checking whether dvm_cpumask has changed. Signed-off-by: NQuan Zhou <zhouquan65@huawei.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NNianyao Tang <tangnianyao@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Quan Zhou 提交于
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L CVE: NA ---------------------------------------------------- We already have cpus_ptr in current thread struct now, through which we can know the pcpu range the thread is allowed to run on. So in kvm_arch_vcpu_{load,put}, we can also know the pcpu range the vcpu thread is allowed to be scheduled on, and that is the range we want to configure for TLBI broadcast. Introduce two variables cpus_ptr and pre_cpus_ptr in struct kvm_vcpu_arch. @cpus_ptr always comes from current->cpus_ptr and @pre_cpus_ptr always comes from @cpus_ptr. Signed-off-by: NQuan Zhou <zhouquan65@huawei.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NNianyao Tang <tangnianyao@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Quan Zhou 提交于
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L CVE: NA ---------------------------------------------------- DVMBM is an virtualization extension since HIP09, which allows TLBI executed at NS EL1 to be broadcast in a configurable range of physical CPUs (even with HCR_EL2.FB set). It will bring TLBI broadcast optimization. Introduce the method to detect and enable this feature. Also add a kernel command parameter "kvm-arm.dvmbm_enabled" (=0 on default) so that users can {en,dis}able DVMBM on need. The parameter description is added under Documentation/. Signed-off-by: NQuan Zhou <zhouquan65@huawei.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NNianyao Tang <tangnianyao@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Quan Zhou 提交于
virt inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L CVE: NA ---------------------------------------------------- Add a new entry ("HIP09") in oem_str[] to support detection of the new HiSi CPU type. Signed-off-by: NQuan Zhou <zhouquan65@huawei.com> Reviewed-by: NZenghui Yu <yuzenghui@huawei.com> Reviewed-by: NNianyao Tang <tangnianyao@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
stable inclusion from stable-v5.10.150 commit 45c33966759ea1b4040c08dacda99ef623c0ca29 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62WRY CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=45c33966759ea1b4040c08dacda99ef623c0ca29 -------------------------------- commit 958f32ce upstream. The vma_lock and hugetlb_fault_mutex are dropped before handling userfault and reacquire them again after handle_userfault(), but reacquire the vma_lock could lead to UAF[1,2] due to the following race, hugetlb_fault hugetlb_no_page /*unlock vma_lock */ hugetlb_handle_userfault handle_userfault /* unlock mm->mmap_lock*/ vm_mmap_pgoff do_mmap mmap_region munmap_vma_range /* clean old vma */ /* lock vma_lock again <--- UAF */ /* unlock vma_lock */ Since the vma_lock will unlock immediately after hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in hugetlb_handle_userfault() to fix the issue. [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/ [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/ Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook") Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com Reported-by: NLiu Zixian <liuzixian4@huawei.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> [4.14+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: mm/hugetlb.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yuanzheng Song 提交于
stable inclusion from stable-v5.10.153 commit 935a8b6202101d7f58fe9cd11287f9cec0d8dd32 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5XS4G CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=935a8b6202101d7f58fe9cd11287f9cec0d8dd32 -------------------------------- The vma->anon_vma of the child process may be NULL because the entire vma does not contain anonymous pages. In this case, a BUG will occur when the copy_present_page() passes a copy of a non-anonymous page of that vma to the page_add_new_anon_rmap() to set up new anonymous rmap. ------------[ cut here ]------------ kernel BUG at mm/rmap.c:1044! Internal error: Oops - BUG: 0 [#1] SMP Modules linked in: CPU: 2 PID: 3617 Comm: test Not tainted 5.10.149 #1 Hardware name: linux,dummy-virt (DT) pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--) pc : __page_set_anon_rmap+0xbc/0xf8 lr : __page_set_anon_rmap+0xbc/0xf8 sp : ffff800014c1b870 x29: ffff800014c1b870 x28: 0000000000000001 x27: 0000000010100073 x26: ffff1d65c517baa8 x25: ffff1d65cab0f000 x24: ffff1d65c416d800 x23: ffff1d65cab5f248 x22: 0000000020000000 x21: 0000000000000001 x20: 0000000000000000 x19: fffffe75970023c0 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: 0000000000000000 x10: 0000000000000000 x9 : ffffc3096d5fb858 x8 : 0000000000000000 x7 : 0000000000000011 x6 : ffff5a5c9089c000 x5 : 0000000000020000 x4 : ffff5a5c9089c000 x3 : ffffc3096d200000 x2 : ffffc3096e8d0000 x1 : ffff1d65ca3da740 x0 : 0000000000000000 Call trace: __page_set_anon_rmap+0xbc/0xf8 page_add_new_anon_rmap+0x1e0/0x390 copy_pte_range+0xd00/0x1248 copy_page_range+0x39c/0x620 dup_mmap+0x2e0/0x5a8 dup_mm+0x78/0x140 copy_process+0x918/0x1a20 kernel_clone+0xac/0x638 __do_sys_clone+0x78/0xb0 __arm64_sys_clone+0x30/0x40 el0_svc_common.constprop.0+0xb0/0x308 do_el0_svc+0x48/0xb8 el0_svc+0x24/0x38 el0_sync_handler+0x160/0x168 el0_sync+0x180/0x1c0 Code: 97f8ff85 f9400294 17ffffeb 97f8ff82 (d4210000) ---[ end trace a972347688dc9bd4 ]--- Kernel panic - not syncing: Oops - BUG: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x43095d200000 from 0xffff800010000000 PHYS_OFFSET: 0xffffe29a80000000 CPU features: 0x08200022,61806082 Memory Limit: none ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]--- This problem has been fixed by the commit <fb3d824d> ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap() and page_try_dup_anon_rmap()"), but still exists in the linux-5.10.y branch. This patch is not applicable to this version because of the large version differences. Therefore, fix it by adding non-anonymous page check in the copy_present_page(). Cc: stable@vger.kernel.org Fixes: 70e806e4 ("mm: Do early cow for pinned pages during fork() for ptes") Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com> Acked-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 liaoguojia 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- On version HNAE3_DEVICE_VERSION_V2, the tcam table entry of the FD is obtained by traversing the list recorded by the driver. On version HNAE3_DEVICE_VERSION_V3, a new usage mode of FD is supported, called Queue bond mode. In this mode, the hardware automatically creates rules and the driver does not record the flow table entry. So we needs to check the validity of the entry by traversing the entire hardware entry to dump out the QB tcam table. Signed-off-by: Nliaoguojia <liaoguojia@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Hao Chen 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- When serdes lane support setting 25Gb/s、50Gb/s speed and user wants to set port speed as 50Gb/s, it can be setted as one 50Gb/s serdes lane or two 25Gb/s serdes lanes. So, this patch adds support to query and set lane number by sysfs to satisfy this scenario. Signed-off-by: NHao Chen <chenhao418@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jian Shen 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- For the fd rule of queue bonding is created by hardware automatically, the driver needs to specify the fd counter for each function, then it's available to query how many times the queue bonding fd rules hit. Signed-off-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jian Shen 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- For device version V3, the hardware supports queue bonding mode. VF can not enable queue bond mode unless PF enables it. So VF needs to query whether PF support queue bonding mode when initializing, and query whether PF enables queue bonding mode periodically. For the resource limited, to avoid a VF occupy to many FD rule space, only trust VF is allowed to enable queue bonding mode. Signed-off-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jian Shen 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- For device version V3, it supports queue bonding, which can identify the tuple information of TCP stream, and create flow director rules automatically, in order to keep the tx and rx packets are in the same queue pair. The driver set FD_ADD field of TX BD for TCP SYN packet, and set FD_DEL filed for TCP FIN or RST packet. The hardware create or remove a fd rule according to the TX BD, and it also support to age-out a rule if not hit for a long time. The queue bonding mode is default to be disabled, and can be enabled/disabled with ethtool priv-flags command. Signed-off-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jian Shen 提交于
driver inclusion category:feature bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2 ---------------------------------------------------------------------- Currently, the PF check the VF alive by the KEEP_ALVE mailbox from VF. VF keep sending the mailbox per 2 seconds. Once PF lost the mailbox for more than 8 seconds, it will regards the VF is abnormal, and stop notifying the state change to VF, include link state, vf mac, reset, even though it receives the KEEP_ALIVE mailbox again. It's inreasonable. This patch fixes it. PF will record the state change which need to notify VF when lost the VF's KEEP_ALIVE mailbox. And notify VF when receive the mailbox again. Introduce a new flag HCLGE_VPORT_STATE_INITED, used to distinguish the case whether VF driver loaded or not. For VF will query these states when initializing, so it's unnecessary to notify it in this case. Signed-off-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Reviewed-by: NJian Shen <shenjian15@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 GUO Zihua 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I62DVN CVE: NA -------------------------------- Syzkaller reported a UAF in mpi_key_length(). BUG: KASAN: use-after-free in mpi_key_length+0x34/0xb0 Read of size 2 at addr ffff888005737e14 by task syz-executor.15/6236 CPU: 1 PID: 6236 Comm: syz-executor.15 Kdump: loaded Tainted: GF OE 5.10.0.kasan.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220525_182517-szxrtosci10000 04/01/2014 Call Trace: dump_stack+0x9c/0xd3 print_address_description.constprop.0+0x19/0x170 __kasan_report.cold+0x6c/0x84 kasan_report+0x3a/0x50 check_memory_region+0xfd/0x1f0 mpi_key_length+0x34/0xb0 pgp_calc_pkey_keyid.isra.0+0x100/0x5a0 pgp_generate_fingerprint+0x159/0x330 pgp_process_public_key+0x1c5/0x330 pgp_parse_packets+0xf4/0x200 pgp_key_parse+0xb6/0x340 asymmetric_key_preparse+0x8a/0x120 key_create_or_update+0x31f/0x8c0 __se_sys_add_key+0x23e/0x400 do_syscall_64+0x30/0x40 entry_SYSCALL_64_after_hwframe+0x61/0xc6 The root cause of the issue is that pgp_calc_pkey_keyid() would call mpi_key_length() and get the length of the public key. The length was then ducted from keylen, which is an unsigned value. However, the returnd byte count is not checked for legitimacy in mpi_key_length(), resulting in an inverted keylen, hence the read overflow. It turns out that the byte count check was mistakenly placed in mpi_read_from_buffer() while commit 94479061 ("mpi: introduce mpi_key_length()") tries to extract mpi_key_length() out of mpi_read_from_buffer(). This patch moves the check into mpi_key_length(). Fixes: commit 94479061 ("mpi: introduce mpi_key_length()") Signed-off-by: NGUO Zihua <guozihua@huawei.com> Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yuyao Lin 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61XP8 -------------------------------- This reverts commit 098b0e01. Function timespec64_to_ns() Add the upper and lower limits check in commit cb477557 ("time: Prevent undefined behaviour in timespec64_to_ns()"), timespec64_to_ktime() only check the upper limits,so revert this patch can fix overflow. Signed-off-by: NYuyao Lin <linyuyao1@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Luís Henriques 提交于
stable inclusion from stable-v5.10.146 commit 958b0ee23f5ac106e7cc11472b71aa2ea9a033bc category: bugfix bugzilla: 187444, https://gitee.com/openeuler/kernel/issues/I6261Z CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=958b0ee23f5ac106e7cc11472b71aa2ea9a033bc -------------------------------- commit 29a5b8a1 upstream. When walking through an inode extents, the ext4_ext_binsearch_idx() function assumes that the extent header has been previously validated. However, there are no checks that verify that the number of entries (eh->eh_entries) is non-zero when depth is > 0. And this will lead to problems because the EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this: [ 135.245946] ------------[ cut here ]------------ [ 135.247579] kernel BUG at fs/ext4/extents.c:2258! [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4 [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014 [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0 [ 135.256475] Code: [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246 [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023 [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024 [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000 [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0 [ 135.277952] Call Trace: [ 135.278635] <TASK> [ 135.279247] ? preempt_count_add+0x6d/0xa0 [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0 [ 135.281612] ? _raw_read_unlock+0x18/0x30 [ 135.282704] ext4_map_blocks+0x294/0x5a0 [ 135.283745] ? xa_load+0x6f/0xa0 [ 135.284562] ext4_mpage_readpages+0x3d6/0x770 [ 135.285646] read_pages+0x67/0x1d0 [ 135.286492] ? folio_add_lru+0x51/0x80 [ 135.287441] page_cache_ra_unbounded+0x124/0x170 [ 135.288510] filemap_get_pages+0x23d/0x5a0 [ 135.289457] ? path_openat+0xa72/0xdd0 [ 135.290332] filemap_read+0xbf/0x300 [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40 [ 135.292192] new_sync_read+0x103/0x170 [ 135.293014] vfs_read+0x15d/0x180 [ 135.293745] ksys_read+0xa1/0xe0 [ 135.294461] do_syscall_64+0x3c/0x80 [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0 This patch simply adds an extra check in __ext4_ext_check(), verifying that eh_entries is not 0 when eh_depth is > 0. Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941 Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283 Cc: Baokun Li <libaokun1@huawei.com> Cc: stable@kernel.org Signed-off-by: NLuís Henriques <lhenriques@suse.de> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NBaokun Li <libaokun1@huawei.com> Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Ziyang Xuan 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61PL4 CVE: NA -------------------------------- Under sockmap redirect scenario, destroy sock when psock->ingress_msg is not empty. Get a warning as following: ================================================= WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x408/0x430 ... Call Trace: <IRQ> __sk_destruct+0x3d/0x590 net/core/sock.c:1784 sk_destruct net/core/sock.c:1829 [inline] __sk_free+0x106/0x2a0 net/core/sock.c:1840 sk_free+0x7d/0xb0 net/core/sock.c:1851 sock_put include/net/sock.h:1813 [inline] tcp_v4_rcv+0x23af/0x26e0 net/ipv4/tcp_ipv4.c:2085 ip_protocol_deliver_rcu+0xe5/0x440 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0xd2/0x110 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:304 [inline] ip_local_deliver+0x10a/0x260 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:459 [inline] ip_rcv_finish+0x126/0x160 net/ipv4/ip_input.c:428 NF_HOOK include/linux/netfilter.h:304 [inline] ip_rcv+0xbf/0x1d0 net/ipv4/ip_input.c:539 __netif_receive_skb_one_core+0x15f/0x190 net/core/dev.c:5366 __netif_receive_skb+0x2e/0xe0 net/core/dev.c:5480 process_backlog+0x132/0x2c0 net/core/dev.c:6386 napi_poll+0x17e/0x4f0 net/core/dev.c:6837 net_rx_action+0x183/0x3c0 net/core/dev.c:6907 That is because commit 7e41dfae18b1 ("[Huawei] bpf, sockmap: Add sk_rmem_alloc check for sockmap") does not consider redirect scenario, reduce sk_rmem_alloc without increasing sk_rmem_alloc. That would result in sk_rmem_alloc underflow. Fixes: 8818e269 ("bpf, sockmap: Add sk_rmem_alloc check for sockmap") Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Guan Jing 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M CVE: NA -------------------------------- When doing wakeups, attempt to limit superfluous scans of the LLC domain. ARM64 enables SIS_UTIL and disables SIS_PROP to search idle CPU based on sum of util_avg. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Guan Jing 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M CVE: NA -------------------------------- The sched_domain_shared structure is only used as pointer, and other drivers don't use it directly. Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: Nzhangjialin <zhangjialin11@huawei.com> Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Chen Yu 提交于
mainline inclusion from mainline-v6.0-rc1 commit 70fb5ccf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=70fb5ccf2ebb09a0c8ebba775041567812d45 -------------------------------- [Problem Statement] select_idle_cpu() might spend too much time searching for an idle CPU, when the system is overloaded. The following histogram is the time spent in select_idle_cpu(), when running 224 instances of netperf on a system with 112 CPUs per LLC domain: @usecs: [0] 533 | | [1] 5495 | | [2, 4) 12008 | | [4, 8) 239252 | | [8, 16) 4041924 |@@@@@@@@@@@@@@ | [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [256, 512) 4507667 |@@@@@@@@@@@@@@@ | [512, 1K) 2600472 |@@@@@@@@@ | [1K, 2K) 927912 |@@@ | [2K, 4K) 218720 | | [4K, 8K) 98161 | | [8K, 16K) 37722 | | [16K, 32K) 6715 | | [32K, 64K) 477 | | [64K, 128K) 7 | | netperf latency usecs: ======= case load Lat_99th std% TCP_RR thread-224 257.39 ( 0.21) The time spent in select_idle_cpu() is visible to netperf and might have a negative impact. [Symptom analysis] The patch [1] from Mel Gorman has been applied to track the efficiency of select_idle_sibling. Copy the indicators here: SIS Search Efficiency(se_eff%): A ratio expressed as a percentage of runqueues scanned versus idle CPUs found. A 100% efficiency indicates that the target, prev or recent CPU of a task was idle at wakeup. The lower the efficiency, the more runqueues were scanned before an idle CPU was found. SIS Domain Search Efficiency(dom_eff%): Similar, except only for the slower SIS patch. SIS Fast Success Rate(fast_rate%): Percentage of SIS that used target, prev or recent CPUs. SIS Success rate(success_rate%): Percentage of scans that found an idle CPU. The test is based on Aubrey's schedtests tool, including netperf, hackbench, schbench and tbench. Test on vanilla kernel: schedstat_parse.py -f netperf_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% TCP_RR 28 threads 99.978 18.535 99.995 100.000 TCP_RR 56 threads 99.397 5.671 99.964 100.000 TCP_RR 84 threads 21.721 6.818 73.632 100.000 TCP_RR 112 threads 12.500 5.533 59.000 100.000 TCP_RR 140 threads 8.524 4.535 49.020 100.000 TCP_RR 168 threads 6.438 3.945 40.309 99.999 TCP_RR 196 threads 5.397 3.718 32.320 99.982 TCP_RR 224 threads 4.874 3.661 25.775 99.767 UDP_RR 28 threads 99.988 17.704 99.997 100.000 UDP_RR 56 threads 99.528 5.977 99.970 100.000 UDP_RR 84 threads 24.219 6.992 76.479 100.000 UDP_RR 112 threads 13.907 5.706 62.538 100.000 UDP_RR 140 threads 9.408 4.699 52.519 100.000 UDP_RR 168 threads 7.095 4.077 44.352 100.000 UDP_RR 196 threads 5.757 3.775 35.764 99.991 UDP_RR 224 threads 5.124 3.704 28.748 99.860 schedstat_parse.py -f schbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% normal 1 mthread 99.152 6.400 99.941 100.000 normal 2 mthreads 97.844 4.003 99.908 100.000 normal 3 mthreads 96.395 2.118 99.917 99.998 normal 4 mthreads 55.288 1.451 98.615 99.804 normal 5 mthreads 7.004 1.870 45.597 61.036 normal 6 mthreads 3.354 1.346 20.777 34.230 normal 7 mthreads 2.183 1.028 11.257 21.055 normal 8 mthreads 1.653 0.825 7.849 15.549 schedstat_parse.py -f hackbench_vanilla.log (each group has 28 tasks) case load se_eff% dom_eff% fast_rate% success_rate% process-pipe 1 group 99.991 7.692 99.999 100.000 process-pipe 2 groups 99.934 4.615 99.997 100.000 process-pipe 3 groups 99.597 3.198 99.987 100.000 process-pipe 4 groups 98.378 2.464 99.958 100.000 process-pipe 5 groups 27.474 3.653 89.811 99.800 process-pipe 6 groups 20.201 4.098 82.763 99.570 process-pipe 7 groups 16.423 4.156 77.398 99.316 process-pipe 8 groups 13.165 3.920 72.232 98.828 process-sockets 1 group 99.977 5.882 99.999 100.000 process-sockets 2 groups 99.927 5.505 99.996 100.000 process-sockets 3 groups 99.397 3.250 99.980 100.000 process-sockets 4 groups 79.680 4.258 98.864 99.998 process-sockets 5 groups 7.673 2.503 63.659 92.115 process-sockets 6 groups 4.642 1.584 58.946 88.048 process-sockets 7 groups 3.493 1.379 49.816 81.164 process-sockets 8 groups 3.015 1.407 40.845 75.500 threads-pipe 1 group 99.997 0.000 100.000 100.000 threads-pipe 2 groups 99.894 2.932 99.997 100.000 threads-pipe 3 groups 99.611 4.117 99.983 100.000 threads-pipe 4 groups 97.703 2.624 99.937 100.000 threads-pipe 5 groups 22.919 3.623 87.150 99.764 threads-pipe 6 groups 18.016 4.038 80.491 99.557 threads-pipe 7 groups 14.663 3.991 75.239 99.247 threads-pipe 8 groups 12.242 3.808 70.651 98.644 threads-sockets 1 group 99.990 6.667 99.999 100.000 threads-sockets 2 groups 99.940 5.114 99.997 100.000 threads-sockets 3 groups 99.469 4.115 99.977 100.000 threads-sockets 4 groups 87.528 4.038 99.400 100.000 threads-sockets 5 groups 6.942 2.398 59.244 88.337 threads-sockets 6 groups 4.359 1.954 49.448 87.860 threads-sockets 7 groups 2.845 1.345 41.198 77.102 threads-sockets 8 groups 2.871 1.404 38.512 74.312 schedstat_parse.py -f tbench_vanilla.log case load se_eff% dom_eff% fast_rate% success_rate% loopback 28 threads 99.976 18.369 99.995 100.000 loopback 56 threads 99.222 7.799 99.934 100.000 loopback 84 threads 19.723 6.819 70.215 100.000 loopback 112 threads 11.283 5.371 55.371 99.999 loopback 140 threads 0.000 0.000 0.000 0.000 loopback 168 threads 0.000 0.000 0.000 0.000 loopback 196 threads 0.000 0.000 0.000 0.000 loopback 224 threads 0.000 0.000 0.000 0.000 According to the test above, if the system becomes busy, the SIS Search Efficiency(se_eff%) drops significantly. Although some benchmarks would finally find an idle CPU(success_rate% = 100%), it is doubtful whether it is worth it to search the whole LLC domain. [Proposal] It would be ideal to have a crystal ball to answer this question: How many CPUs must a wakeup path walk down, before it can find an idle CPU? Many potential metrics could be used to predict the number. One candidate is the sum of util_avg in this LLC domain. The benefit of choosing util_avg is that it is a metric of accumulated historic activity, which seems to be smoother than instantaneous metrics (such as rq->nr_running). Besides, choosing the sum of util_avg would help predict the load of the LLC domain more precisely, because SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle time. In summary, the lower the util_avg is, the more select_idle_cpu() should scan for idle CPU, and vice versa. When the sum of util_avg in this LLC domain hits 85% or above, the scan stops. The reason to choose 85% as the threshold is that this is the imbalance_pct(117) when a LLC sched group is overloaded. Introduce the quadratic function: y = SCHED_CAPACITY_SCALE - p * x^2 and y'= y / SCHED_CAPACITY_SCALE x is the ratio of sum_util compared to the CPU capacity: x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE) y' is the ratio of CPUs to be scanned in the LLC domain, and the number of CPUs to scan is calculated by: nr_scan = llc_weight * y' Choosing quadratic function is because: [1] Compared to the linear function, it scans more aggressively when the sum_util is low. [2] Compared to the exponential function, it is easier to calculate. [3] It seems that there is no accurate mapping between the sum of util_avg and the number of CPUs to be scanned. Use heuristic scan for now. For a platform with 112 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 112 111 108 102 93 81 65 47 25 1 0 ... For a platform with 16 CPUs per LLC, the number of CPUs to scan is: sum_util% 0 5 15 25 35 45 55 65 75 85 86 ... scan_nr 16 15 15 14 13 11 9 6 3 0 0 ... Furthermore, to minimize the overhead of calculating the metrics in select_idle_cpu(), borrow the statistics from periodic load balance. As mentioned by Abel, on a platform with 112 CPUs per LLC, the sum_util calculated by periodic load balance after 112 ms would decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay in reflecting the latest utilization. But it is a trade-off. Checking the util_avg in newidle load balance would be more frequent, but it brings overhead - multiple CPUs write/read the per-LLC shared variable and introduces cache contention. Tim also mentioned that, it is allowed to be non-optimal in terms of scheduling for the short-term variations, but if there is a long-term trend in the load behavior, the scheduler can adjust for that. When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and Mel suggested, SIS_UTIL should be enabled by default. This patch is based on the util_avg, which is very sensitive to the CPU frequency invariance. There is an issue that, when the max frequency has been clamp, the util_avg would decay insanely fast when the CPU is idle. Commit addca285 ("cpufreq: intel_pstate: Handle no_turbo in frequency invariance") could be used to mitigate this symptom, by adjusting the arch_max_freq_ratio when turbo is disabled. But this issue is still not thoroughly fixed, because the current code is unaware of the user-specified max CPU frequency. [Test result] netperf and tbench were launched with 25% 50% 75% 100% 125% 150% 175% 200% of CPU number respectively. Hackbench and schbench were launched by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times. The following is the benchmark result comparison between baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare% indicates better performance. Each netperf test is a: netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100 netperf.throughput ======= case load baseline(std%) compare%( std%) TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40) TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20) TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40) TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22) TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19) TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18) TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43) TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85) UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33) UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21) UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32) UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61) UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04) UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16) UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93) UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62) Take the 224 threads as an example, the SIS search metrics changes are illustrated below: vanilla patched 4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg 38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg 128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg 5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg There is -87.9% less CPU scans after patched, which indicates lower overhead. Besides, with this patch applied, there is -13% less rq lock contention in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested .try_to_wake_up.default_wake_function.woken_wake_function. This might help explain the performance improvement - Because this patch allows the waking task to remain on the previous CPU, rather than grabbing other CPUs' lock. Each hackbench test is a: hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100 hackbench.throughput ========= case load baseline(std%) compare%( std%) process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47) process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81) process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02) process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02) process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13) process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14) process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26) process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03) threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14) threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74) threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13) threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06) threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53) threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26) threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24) threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11) Each tbench test is a: tbench -t 100 $job 127.0.0.1 tbench.throughput ====== case load baseline(std%) compare%( std%) loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09) loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17) loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13) loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03) loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19) loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27) loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17) loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22) Each schbench test is a: schbench -m $job -t 28 -r 100 -s 30000 -c 30000 schbench.latency_90%_us ======== case load baseline(std%) compare%( std%) normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)* normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79) normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64) normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28) *Consider the Standard Deviation, this -7.36% regression might not be valid. Also, a OLTP workload with a commercial RDBMS has been tested, and there is no significant change. There were concerns that unbalanced tasks among CPUs would cause problems. For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are bound to CPU0~CPU6, while CPU7 is idle: CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 util_avg 1024 1024 1024 1024 1024 1024 1024 0 Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%, select_idle_cpu() will not scan, thus CPU7 is undetected during scan. But according to Mel, it is unlikely the CPU7 will be idle all the time because CPU7 could pull some tasks via CPU_NEWLY_IDLE. lkp(kernel test robot) has reported a regression on stress-ng.sock on a very busy system. According to the sched_debug statistics, it might be caused by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this might introduce more context switch, especially involuntary preemption, which impacts a busy stress-ng. This regression has shown that, not all benchmarks in every scenario benefit from idle CPU scan limit, and it needs further investigation. Besides, there is slight regression in hackbench's 16 groups case when the LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively in an LLC domain with 16 CPUs. Because the cost to search for an idle one among 16 CPUs is negligible. The current patch aims to propose a generic solution and only considers the util_avg. Something like the below could be applied on top of the current patch to fulfill the requirement: if (llc_weight <= 16) nr_scan = nr_scan * 32 / llc_weight; For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large. The smaller the CPU number this LLC domain has, the larger nr_scan will be expanded. This needs further investigation. There is also ongoing work[2] from Abel to filter out the busy CPUs during wakeup, to further speed up the idle CPU scan. And it could be a following-up optimization on top of this change. Suggested-by: NTim Chen <tim.c.chen@intel.com> Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NChen Yu <yu.c.chen@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NYicong Yang <yangyicong@hisilicon.com> Tested-by: NMohini Narkhede <mohini.narkhede@intel.com> Tested-by: NK Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.comSigned-off-by: NJialin Zhang <zhangjialin11@huawei.com> Signed-off-by: NGuan Jing <guanjing6@huawei.com> Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Li Lingfeng 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60QE9 CVE: NA -------------------------------- As explained in 32c39e8a ("block: fix use after free for bd_holder_dir"), we should make sure the "disk" is still live and then grab a reference to 'bd_holder_dir'. However, the "disk" should be "the claimed slave bdev" rather than "the holding disk". Fixes: 32c39e8a ("block: fix use after free for bd_holder_dir") Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com> Reviewed-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Michal Simek 提交于
mainline inclusion from mainline-v5.13-rc1 commit 6a37d750 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60OLE CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6a37d750037827d385672acdebf5788fc2ffa633 -------------------------------- Static analyzer tool found that the ret variable is not initialized but code expects ret value >=0 when pinconf is skipped in the first pinmux loop. The same expectation is for pinmux in a pinconf loop. That's why initialize ret to 0 to avoid uninitialized ret value in first loop or reusing ret value from first loop in second. Addresses-Coverity: ("Uninitialized variables") Signed-off-by: NMichal Simek <michal.simek@xilinx.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NColin Ian King <colin.king@canonical.com> Link: https://lore.kernel.org/r/e5203bae68eb94b4b8b4e67e5e7b4d86bb989724.1615534291.git.michal.simek@xilinx.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NYuyao Lin <linyuyao1@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Michal Simek 提交于
mainline inclusion from mainline-v5.13-rc1 commit b991f8c3 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60OLE CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b991f8c3622c8c9d01a1ada382682a731932e651 -------------------------------- Right now the handling order depends on how entries are coming which is corresponding with order in DT. We have reached the case with DT overlays where conf and mux descriptions are exchanged which ends up in sequence that firmware has been asked to perform configuration before requesting the pin. The patch is enforcing the order that pin is requested all the time first followed by pin configuration. This change will ensure that firmware gets requests in the right order. Signed-off-by: NMichal Simek <michal.simek@xilinx.com> Link: https://lore.kernel.org/r/cfbe01f791c2dd42a596cbda57e15599969b57aa.1615364211.git.michal.simek@xilinx.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NYuyao Lin <linyuyao1@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yu Kuai 提交于
mainline inclusion from mainline-v5.16-rc2 commit 76dd2980 category: bugfix bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VGU9 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76dd298094f484c6250ebd076fa53287477b2328 -------------------------------- Our syzkaller report a null pointer dereference, root cause is following: __blk_mq_alloc_map_and_rqs set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs blk_mq_alloc_map_and_rqs blk_mq_alloc_rqs // failed due to oom alloc_pages_node // set->tags[hctx_idx] is still NULL blk_mq_free_rqs drv_tags = set->tags[hctx_idx]; // null pointer dereference is triggered blk_mq_clear_rq_mapping(drv_tags, ...) This is because commit 63064be1 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") merged the two steps: 1) set->tags[hctx_idx] = blk_mq_alloc_rq_map() 2) blk_mq_alloc_rqs(..., set->tags[hctx_idx]) into one step: set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs() Since tags is not initialized yet in this case, fix the problem by checking if tags is NULL pointer in blk_mq_clear_rq_mapping(). Fixes: 63064be1 ("blk-mq: Add blk_mq_alloc_map_and_rqs()") Signed-off-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NJohn Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yu Kuai 提交于
stable inclusion from stable-v5.10.152 commit 31b1570677e8bf85f48be8eb95e21804399b8295 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=31b1570677e8bf85f48be8eb95e21804399b8295 ------------------------------- commit 285febab upstream. commit 8c5035df ("blk-wbt: call rq_qos_add() after wb_normal is initialized") moves wbt_set_write_cache() before rq_qos_add(), which is wrong because wbt_rq_qos() is still NULL. Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc' directly. Noted that this patch also remove the redundant setting of 'rab->wc'. Fixes: 8c5035df ("blk-wbt: call rq_qos_add() after wb_normal is initialized") Reported-by: Nkernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.comSigned-off-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yu Kuai 提交于
stable inclusion from stable-v5.10.152 commit 910ba49b33450a878128adc7d9c419dd97efd923 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=910ba49b33450a878128adc7d9c419dd97efd923 ------------------------------- commit 8c5035df upstream. Our test found a problem that wbt inflight counter is negative, which will cause io hang(noted that this problem doesn't exist in mainline): t1: device create t2: issue io add_disk blk_register_queue wbt_enable_default wbt_init rq_qos_add // wb_normal is still 0 /* * in mainline, disk can't be opened before * bdev_add(), however, in old kernels, disk * can be opened before blk_register_queue(). */ blkdev_issue_flush // disk size is 0, however, it's not checked submit_bio_wait submit_bio blk_mq_submit_bio rq_qos_throttle wbt_wait bio_to_wbt_flags rwb_enabled // wb_normal is 0, inflight is not increased wbt_queue_depth_changed(&rwb->rqos); wbt_update_limits // wb_normal is initialized rq_qos_track wbt_track rq->wbt_flags |= bio_to_wbt_flags(rwb, bio); // wb_normal is not 0,wbt_flags will be set t3: io completion blk_mq_free_request rq_qos_done wbt_done wbt_is_tracked // return true __wbt_done wbt_rqw_done atomic_dec_return(&rqw->inflight); // inflight is decreased commit 8235b5c1 ("block: call bdev_add later in device_add_disk") can avoid this problem, however it's better to fix this problem in wbt: 1) Lower kernel can't backport this patch due to lots of refactor. 2) Root cause is that wbt call rq_qos_add() before wb_normal is initialized. Fixes: e34cbd30 ("blk-wbt: add general throttling mechanism") Cc: <stable@vger.kernel.org> Signed-off-by: NYu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Lei Chen 提交于
stable inclusion from stable-v5.10.152 commit 392536023da18086d57565e716ed50193869b8e7 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=392536023da18086d57565e716ed50193869b8e7 ------------------------------- commit 5a20d073 upstream. It's unnecessary to call wbt_update_limits explicitly within wbt_init, because it will be called in the following function wbt_queue_depth_changed. Signed-off-by: NLei Chen <lennychen@tencent.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NYu Kuai <yukuai3@huawei.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-