提交 · dbe740d5eaf1072039c55ea9f41c3b9d807ad956 · openeuler / Kernel

30 11月, 2022 40 次提交

dm: Fix UAF in run_timer_softirq() · dbe740d5

由 Luo Meng 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5WBID
CVE: NA

--------------------------------

When dm_resume() and dm_destroy() are concurrent, it will
lead to UAF.

One of the concurrency UAF can be shown as below:

        use                                  free
do_resume                           |
  __find_device_hash_cell           |
    dm_get                          |
      atomic_inc(&md->holders)      |
                                    | dm_destroy
				    |   __dm_destroy
				    |     if (!dm_suspended_md(md))
                                    |     atomic_read(&md->holders)
				    |     msleep(1)
  dm_resume                         |
    __dm_resume                     |
      dm_table_resume_targets       |
	pool_resume                 |
	  do_waker  #add delay work |
				    |     dm_table_destroy
				    |       pool_dtr
				    |         __pool_dec
                                    |           __pool_destroy
                                    |             destroy_workqueue
                                    |             kfree(pool) # free pool
	time out
__do_softirq
  run_timer_softirq # pool has already been freed

This can be easily reproduced using:
  1. create thin-pool
  2. dmsetup suspend pool
  3. dmsetup resume pool
  4. dmsetup remove_all # Concurrent with 3

The root cause of UAF bugs is that dm_resume() adds timer after
dm_destroy() skips cancel timer beause of suspend status. After
timeout, it will call run_timer_softirq(), however pool has already
been freed. The concurrency UAF bug will happen.

Therefore, canceling timer is moved after md->holders is zero.
Signed-off-by: NLuo Meng <luomeng12@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dbe740d5

livepatch/ppc64: Fix preemption check when enabling · 7afb0d96

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I60N44
CVE: NA

--------------------------------

Misspelling of 'CONFIG_PREEMPTION' may cause old function not being
checked, which results in a running function being livepatched.

Fixes: 20106abf ("livepatch: Check whole stack when CONFIG_PREEMPT is set")
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7afb0d96

livepatch: Avoid CPU hogging with cond_resched · 4a8fafe8

由 David Vernet 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.17-rc1
commit f5bdb34b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60MYE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f5bdb34bf0c9314548f2d8e2360b703ff3610303

--------------------------------

When initializing a 'struct klp_object' in klp_init_object_loaded(), and
performing relocations in klp_resolve_symbols(), klp_find_object_symbol()
is invoked to look up the address of a symbol in an already-loaded module
(or vmlinux). This, in turn, calls kallsyms_on_each_symbol() or
module_kallsyms_on_each_symbol() to find the address of the symbol that is
being patched.

It turns out that symbol lookups often take up the most CPU time when
enabling and disabling a patch, and may hog the CPU and cause other tasks
on that CPU's runqueue to starve -- even in paths where interrupts are
enabled.  For example, under certain workloads, enabling a KLP patch with
many objects or functions may cause ksoftirqd to be starved, and thus for
interrupts to be backlogged and delayed. This may end up causing TCP
retransmits on the host where the KLP patch is being applied, and in
general, may cause any interrupts serviced by softirqd to be delayed while
the patch is being applied.

So as to ensure that kallsyms_on_each_symbol() does not end up hogging the
CPU, this patch adds a call to cond_resched() in kallsyms_on_each_symbol()
and module_kallsyms_on_each_symbol(), which are invoked when doing a symbol
lookup in vmlinux and a module respectively.  Without this patch, if a
live-patch is applied on a 36-core Intel host with heavy TCP traffic, a
~10x spike is observed in TCP retransmits while the patch is being applied.
Additionally, collecting sched events with perf indicates that ksoftirqd is
awakened ~1.3 seconds before it's eventually scheduled.  With the patch, no
increase in TCP retransmit events is observed, and ksoftirqd is scheduled
shortly after it's awakened.
Signed-off-by: NDavid Vernet <void@manifault.com>
Acked-by: NMiroslav Benes <mbenes@suse.cz>
Acked-by: NSong Liu <song@kernel.org>
Signed-off-by: NPetr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211229215646.830451-1-void@manifault.comSigned-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4a8fafe8

livepatch: Fix several code style issues · 99cc16a2

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60MKD
CVE: NA

--------------------------------

Fix several code style issues:
- Do not use magic numbers.The number is 10
- Do not use parentheses when printing numbers.
- Braces {} are not necessary for single statement blocks
- Do not add blank lines on the start of a code block defined by braces.
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

99cc16a2

livepatch/x86: Avoid conflict with static {call,key} · 2c3c0b3a

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10
CVE: NA

--------------------------------

static call and static key allow user to modify instructions on call
site, relate configs are: CONFIG_HAVE_STATIC_CALL_INLINE for static
call, CONFIG_JUMP_LABEL for static key.

When they exist in first several instruction of an old function, and
livepatch could also modify there, then confliction happened.

To avoid the confliction, we don't allow a livepatch module of this case
to be inserted.

Fixes: c33e4283 ("livepatch/core: Allow implementation without ftrace")
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2c3c0b3a

livepatch/core: Restrict minimum size of function that can be patched · ed8c4c72

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10
CVE: NA

--------------------------------

If a function is patched, instructions at the beginning are modified to be
'jump codes' which jump to new function. This requires the function be big
enough, otherwise the modification may be out of function range.

Currently each architecture needs to implement arch_klp_func_can_patch()
to check function size. However, there exists following problems:
  1. arch 'x86' didn't implement arch_klp_func_can_patch();
  2. implementations in arm64 & ppc32, function size is checked only if
     there's a long jump. There is a scenario where a very short function
     is successfully patched, but as kernel module increases, someday long
     jump is required, then the function become unable to be patched.
  3. implementaions look like duplicate.

In this patch, introduce macro KLP_MAX_REPLACE_SIZE to denote the maximum
size that will be replaced on patching, then move the check ahead into
klp_init_object_loaded().

Fixes: c33e4283 ("livepatch/core: Allow implementation without ftrace")
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ed8c4c72

livepatch/x86: Rename old_code to old_insns · 8bf83486

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10
CVE: NA

--------------------------------

In arm/arm64/ppc32/ppc64, this field is named as old_insns, so uniform it.
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8bf83486

livepatch: Fix patching functions which have static_call · 1f8a38ca

由 Zheng Yejian 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60L10
CVE: NA

--------------------------------

It was reported that if 'static_call' is used in a old function, then
the livepatch module created by kpatch for that old function cannot be
inserted normally.

Root cause is that relocation of static_call symbols in livepatch module
has not been done while initing:
  load_module
    prepare_coming_module
      blocking_notifier_call_chain_robust
        notifier_call_chain_robust
          static_call_module_notify
            <-- 1. static_call symbols init here, but relocation is done
                   at below MARK "2."
    do_init_module
      do_one_initcall
        klp_register_patch
          klp_init_patch
            klp_init_object
              klp_init_object_loaded    <--  2. relocate .klp.xxx here

To solve it, we move the static_call initialization after relocation.
Signed-off-by: NZheng Yejian <zhengyejian1@huawei.com>
Reviewed-by: NKuohai Xu <xukuohai@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1f8a38ca

dm-thin: Resume failed in FAIL mode · 494188df

由 Luo Meng 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5P05D
CVE: NA

--------------------------------

When thinpool is suspended and sets fail_io, resume will report error
as below:
device-mapper: resume ioctl on vg-thinpool  failed: Invalid argument

Thinpool also can't be removed if bio is in deferred list.

This can be easily reproduced using:

  echo "offline" > /sys/block/sda/device/state
  dd if=/dev/zero of=/dev/mapper/thin bs=4K count=1
  dmsetup suspend /dev/mapper/pool
  mkfs.ext4 /dev/mapper/thin
  dmsetup resume /dev/mapper/pool

The root cause is maybe_resize_data_dev() will check fail_io and return
error before called dm_resume.

Fix this by adding FAIL mode check at the end of  pool_preresume().

Fixes: da105ed5 (dm thin metadata: introduce dm_pool_abort_metadata)
Signed-off-by: NLuo Meng <luomeng12@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

494188df

dm: fix null pointer dereference in dev_create() · 41283328

由 Luo Meng 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I62762
CVE: NA

--------------------------------

A crash as follows:
 BUG: KASAN: null-ptr-deref in dev_create.cold+0x12/0x77
 Read of size 8 at addr 0000000000000020 by task dmsetup/683

 CPU: 4 PID: 683 Comm: dmsetup Not tainted 5.10.0-01524-g884de6e91114-dirty #11
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
 Call Trace:
  ? dump_stack+0xdd/0x126
  ? kasan_report.cold+0xd1/0xdb
  ? dev_create.cold+0x12/0x77
  ? __asan_load8+0xae/0x110
  ? dev_create.cold+0x12/0x77
  ? dev_rename+0x720/0x720
  ? cap_capable+0xcf/0x130
  ? ctl_ioctl+0x2f5/0x750
  ? dev_rename+0x720/0x720
  ? free_params+0x50/0x50
  ? unmerge_queues+0x176/0x1b0
  ? __blkcg_punt_bio_submit+0x110/0x110
  ? mem_cgroup_handle_over_high+0x33/0x5e0
  ? dm_ctl_ioctl+0x12/0x20
  ? __se_sys_ioctl+0xc5/0x120
  ? __x64_sys_ioctl+0x46/0x60
  ? do_syscall_64+0x45/0x70
  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6

This can be easily reproduced using:
  dmsetup create xxx --table "0 1000 linear /dev/sda 0"
  dmsetup remove xxx

Fix this by adding hass_lock in dev_create().

Fixes: a5100d07 ("dm ioctl: add DMINFO() to track dm device create/remove")
Signed-off-by: NLuo Meng <luomeng12@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

41283328

ARM: 9160/1: NOMMU: Reload __secondary_data after PROCINFO_INITFUNC · c2b4dd41

由 Vladimir Murzin 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.16-rc7
commit 7202216a
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7202216a6f34d571a22274e729f841256bf8b1ef

--------------------------------

__secondary_data used to reside in r7 around call to
PROCINFO_INITFUNC. After commit 95731b8e ("ARM: 9059/1: cache-v7:
get rid of mini-stack") r7 is used as a scratch register, so we have
to reload __secondary_data before we setup the stack pointer.

conflict:
	arch/arm/kernel/head-nommu.S

Fixes: 95731b8e ("ARM: 9059/1: cache-v7: get rid of mini-stack")
Signed-off-by: NVladimir Murzin <vladimir.murzin@arm.com>
Signed-off-by: NRussell King (Oracle) <rmk+kernel@armlinux.org.uk>
Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c2b4dd41

ARM: 9059/1: cache-v7: get rid of mini-stack · 5d96d394

由 Ard Biesheuvel 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 95731b8e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95731b8ee63ec9419822a51cd9878fa32582fdd2

--------------------------------

Now that we have reduced the number of registers that we need to
preserve when calling v7_invalidate_l1 from the boot code, we can use
scratch registers to preserve the remaining ones, and get rid of the
mini stack entirely. This works around any issues regarding cache
behavior in relation to the uncached accesses to this memory, which is
hard to get right in the general case (i.e., both bare metal and under
virtualization)

While at it, switch v7_invalidate_l1 to using ip as a scratch register
instead of r4. This makes the function AAPCS compliant, and removes the
need to stash r4 in ip across the call.

conflict:
	arch/arm/include/asm/memory.h
Acked-by: NNicolas Pitre <nico@fluxnic.net>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5d96d394

ARM: 9058/1: cache-v7: refactor v7_invalidate_l1 to avoid clobbering r5/r6 · 44a65ef0

由 Ard Biesheuvel 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.13-rc1
commit f9e7a99f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I634EK
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f9e7a99fb6b86aa6a00e53b34ee6973840e005aa

--------------------------------

The cache invalidation code in v7_invalidate_l1 can be tweaked to
re-read the associativity from CCSIDR, and keep the way identifier
component in a single register that is assigned in the outer loop. This
way, we need 2 registers less.

Given that the number of sets is typically much larger than the
associativity, rearrange the code so that the outer loop has the fewer
number of iterations, ensuring that the re-read of CCSIDR only occurs a
handful of times in practice.

Fix the whitespace while at it, and update the comment to indicate that
this code is no longer a clone of anything else.
Acked-by: NNicolas Pitre <nico@fluxnic.net>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Signed-off-by: NRussell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: NZhang Jianhua <chris.zjh@huawei.com>
Reviewed-by: NLiao Chang <liaochang1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

44a65ef0

KVM: arm64: Implement the capability of DVMBM · 1489b74d

由 Quan Zhou 提交于 11月 30, 2022

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L
CVE: NA

----------------------------------------------------

Implement the capability of DVMBM. Before each vcpu is loaded, we
re-calculate the VM-wide dvm_cpumask, and if it's changed we will kick all
other vcpus out to reload the latest LSUDVMBM value to the register, and a
new request KVM_REQ_RELOAD_DVMBM is added to implement this.

Otherwise if the dvm_cpumask is not changed by this single vcpu, in order
to ensure the correctness of the contents in the register, we reload the
LSUDVMBM value to the register and nothing else will be done.
Signed-off-by: NQuan Zhou <zhouquan65@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NNianyao Tang <tangnianyao@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1489b74d

KVM: arm64: Add kvm_arch::dvm_cpumask and dvm_lock · f0a92b35

由 Quan Zhou 提交于 11月 30, 2022

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L
CVE: NA

----------------------------------------------------

Introduce dvm_cpumask and dvm_lock in struct kvm_arch. dvm_cpumask will
store the union of all vcpus' cpus_ptr and will be used for the TLBI
broadcast range. dvm_lock ensures a exclusive manipulation of dvm_cpumask.

In vcpu_load, we should decide whether to perform the subsequent update
operation by checking whether dvm_cpumask has changed.
Signed-off-by: NQuan Zhou <zhouquan65@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NNianyao Tang <tangnianyao@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f0a92b35

KVM: arm64: Add kvm_vcpu_arch::cpus_ptr and pre_cpus_ptr · 8d396309

由 Quan Zhou 提交于 11月 30, 2022

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L
CVE: NA

----------------------------------------------------

We already have cpus_ptr in current thread struct now, through which we can
know the pcpu range the thread is allowed to run on. So in
kvm_arch_vcpu_{load,put}, we can also know the pcpu range the vcpu thread
is allowed to be scheduled on, and that is the range we want to configure
for TLBI broadcast.

Introduce two variables cpus_ptr and pre_cpus_ptr in struct kvm_vcpu_arch.
@cpus_ptr always comes from current->cpus_ptr and @pre_cpus_ptr always
comes from @cpus_ptr.
Signed-off-by: NQuan Zhou <zhouquan65@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NNianyao Tang <tangnianyao@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8d396309

KVM: arm64: Probe and configure DVMBM capability on HiSi CPUs · 31e53598

由 Quan Zhou 提交于 11月 30, 2022

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L
CVE: NA

----------------------------------------------------

DVMBM is an virtualization extension since HIP09, which allows TLBI
executed at NS EL1 to be broadcast in a configurable range of physical
CPUs (even with HCR_EL2.FB set). It will bring TLBI broadcast optimization.

Introduce the method to detect and enable this feature. Also add a kernel
command parameter "kvm-arm.dvmbm_enabled" (=0 on default) so that users can
{en,dis}able DVMBM on need. The parameter description is added under
Documentation/.
Signed-off-by: NQuan Zhou <zhouquan65@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NNianyao Tang <tangnianyao@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

31e53598

KVM: arm64: Support a new HiSi CPU type · 7b4316e9

由 Quan Zhou 提交于 11月 30, 2022

virt inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62Q2L
CVE: NA

----------------------------------------------------

Add a new entry ("HIP09") in oem_str[] to support detection of the new HiSi
CPU type.
Signed-off-by: NQuan Zhou <zhouquan65@huawei.com>
Reviewed-by: NZenghui Yu <yuzenghui@huawei.com>
Reviewed-by: NNianyao Tang <tangnianyao@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7b4316e9

mm: hugetlb: fix UAF in hugetlb_handle_userfault · 32c96480

由 Liu Shixin 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.150
commit 45c33966759ea1b4040c08dacda99ef623c0ca29
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I62WRY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=45c33966759ea1b4040c08dacda99ef623c0ca29

--------------------------------

commit 958f32ce upstream.

The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
and reacquire them again after handle_userfault(), but reacquire the
vma_lock could lead to UAF[1,2] due to the following race,

hugetlb_fault
  hugetlb_no_page
    /*unlock vma_lock */
    hugetlb_handle_userfault
      handle_userfault
        /* unlock mm->mmap_lock*/
                                           vm_mmap_pgoff
                                             do_mmap
                                               mmap_region
                                                 munmap_vma_range
                                                   /* clean old vma */
        /* lock vma_lock again  <--- UAF */
    /* unlock vma_lock */

Since the vma_lock will unlock immediately after
hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
hugetlb_handle_userfault() to fix the issue.

[1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
[2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
Reported-by: NLiu Zixian <liuzixian4@huawei.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
Cc: <stable@vger.kernel.org>	[4.14+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Conflicts:
	mm/hugetlb.c
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

32c96480

mm/memory: add non-anonymous page check in the copy_present_page() · 0a37c960

由 Yuanzheng Song 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.153
commit 935a8b6202101d7f58fe9cd11287f9cec0d8dd32
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5XS4G
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=935a8b6202101d7f58fe9cd11287f9cec0d8dd32

--------------------------------

The vma->anon_vma of the child process may be NULL because
the entire vma does not contain anonymous pages. In this
case, a BUG will occur when the copy_present_page() passes
a copy of a non-anonymous page of that vma to the
page_add_new_anon_rmap() to set up new anonymous rmap.

------------[ cut here ]------------
kernel BUG at mm/rmap.c:1044!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
CPU: 2 PID: 3617 Comm: test Not tainted 5.10.149 #1
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
pc : __page_set_anon_rmap+0xbc/0xf8
lr : __page_set_anon_rmap+0xbc/0xf8
sp : ffff800014c1b870
x29: ffff800014c1b870 x28: 0000000000000001
x27: 0000000010100073 x26: ffff1d65c517baa8
x25: ffff1d65cab0f000 x24: ffff1d65c416d800
x23: ffff1d65cab5f248 x22: 0000000020000000
x21: 0000000000000001 x20: 0000000000000000
x19: fffffe75970023c0 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 0000000000000000
x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000000 x10: 0000000000000000
x9 : ffffc3096d5fb858 x8 : 0000000000000000
x7 : 0000000000000011 x6 : ffff5a5c9089c000
x5 : 0000000000020000 x4 : ffff5a5c9089c000
x3 : ffffc3096d200000 x2 : ffffc3096e8d0000
x1 : ffff1d65ca3da740 x0 : 0000000000000000
Call trace:
 __page_set_anon_rmap+0xbc/0xf8
 page_add_new_anon_rmap+0x1e0/0x390
 copy_pte_range+0xd00/0x1248
 copy_page_range+0x39c/0x620
 dup_mmap+0x2e0/0x5a8
 dup_mm+0x78/0x140
 copy_process+0x918/0x1a20
 kernel_clone+0xac/0x638
 __do_sys_clone+0x78/0xb0
 __arm64_sys_clone+0x30/0x40
 el0_svc_common.constprop.0+0xb0/0x308
 do_el0_svc+0x48/0xb8
 el0_svc+0x24/0x38
 el0_sync_handler+0x160/0x168
 el0_sync+0x180/0x1c0
Code: 97f8ff85 f9400294 17ffffeb 97f8ff82 (d4210000)
---[ end trace a972347688dc9bd4 ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception
SMP: stopping secondary CPUs
Kernel Offset: 0x43095d200000 from 0xffff800010000000
PHYS_OFFSET: 0xffffe29a80000000
CPU features: 0x08200022,61806082
Memory Limit: none
---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---

This problem has been fixed by the commit <fb3d824d>
("mm/rmap: split page_dup_rmap() into page_dup_file_rmap()
and page_try_dup_anon_rmap()"), but still exists in the
linux-5.10.y branch.

This patch is not applicable to this version because
of the large version differences. Therefore, fix it by
adding non-anonymous page check in the copy_present_page().

Cc: stable@vger.kernel.org
Fixes: 70e806e4 ("mm: Do early cow for pinned pages during fork() for ptes")
Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
Acked-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0a37c960

net: hns3: refactor the debugfs for dumping FD tcam · a8e509c4

由 liaoguojia 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

On version HNAE3_DEVICE_VERSION_V2, the tcam table entry of the FD is
obtained by traversing the list recorded by the driver.

On version HNAE3_DEVICE_VERSION_V3, a new usage mode of FD is supported,
called Queue bond mode. In this mode, the hardware automatically creates
rules and the driver does not record the flow table entry.

So we needs to check the validity of the entry by traversing the entire
hardware entry to dump out the QB tcam table.
Signed-off-by: Nliaoguojia <liaoguojia@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a8e509c4

net: hns3: PF supports to set and query lane_num by sysfs · 7c0849c7

由 Hao Chen 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

When serdes lane support setting 25Gb/s、50Gb/s speed and user wants to
set port speed as 50Gb/s, it can be setted as one 50Gb/s serdes lane or
two 25Gb/s serdes lanes.

So, this patch adds support to query and set lane number by sysfs
to satisfy this scenario.
Signed-off-by: NHao Chen <chenhao418@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7c0849c7

net: hns3: allocate fd counter for queue bonding · b66ae5ce

由 Jian Shen 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

For the fd rule of queue bonding is created by hardware
automatically, the driver needs to specify the fd counter
for each function, then it's available to query how many
times the queue bonding fd rules hit.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b66ae5ce

net: hns3: add queue bonding mode support for VF · 29ecb19f

由 Jian Shen 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

For device version V3, the hardware supports queue bonding
mode. VF can not enable queue bond mode unless PF enables it.
So VF needs to query whether PF support queue bonding mode
when initializing, and query whether PF enables queue bonding
mode periodically. For the resource limited, to avoid a VF
occupy to many FD rule space, only trust VF is allowed to enable
queue bonding mode.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

29ecb19f

net: hns3: add support for queue bonding mode of flow director · 3aa7f186

由 Jian Shen 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

For device version V3, it supports queue bonding, which can
identify the tuple information of TCP stream, and create flow
director rules automatically, in order to keep the tx and rx
packets are in the same queue pair. The driver set FD_ADD
field of TX BD for TCP SYN packet, and set FD_DEL filed for
TCP FIN or RST packet. The hardware create or remove a fd rule
according to the TX BD, and it also support to age-out a rule
if not hit for a long time.

The queue bonding mode is default to be disabled, and can be
enabled/disabled with ethtool priv-flags command.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3aa7f186

net: hns3: refine the handling for VF heartbeat · f96c7722

由 Jian Shen 提交于 11月 30, 2022

driver inclusion
category:feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I62HX2

----------------------------------------------------------------------

Currently, the PF check the VF alive by the KEEP_ALVE
mailbox from VF. VF keep sending the mailbox per 2
seconds. Once PF lost the mailbox for more than 8
seconds, it will regards the VF is abnormal, and stop
notifying the state change to VF, include link state,
vf mac, reset, even though it receives the KEEP_ALIVE
mailbox again. It's inreasonable.

This patch fixes it. PF will record the state change which
need to notify VF when lost the VF's KEEP_ALIVE mailbox.
And notify VF when receive the mailbox again. Introduce a
new flag HCLGE_VPORT_STATE_INITED, used to distinguish the
case whether VF driver loaded or not. For VF will query
these states when initializing, so it's unnecessary to
notify it in this case.
Signed-off-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NJiantao Xiao <xiaojiantao1@h-partners.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NJian Shen <shenjian15@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f96c7722

mpi: Fix length check in mpi_key_length() · 4142005a

由 GUO Zihua 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I62DVN
CVE: NA

--------------------------------

Syzkaller reported a UAF in mpi_key_length().

BUG: KASAN: use-after-free in mpi_key_length+0x34/0xb0
Read of size 2 at addr ffff888005737e14 by task syz-executor.15/6236

CPU: 1 PID: 6236 Comm: syz-executor.15 Kdump: loaded Tainted: GF          OE     5.10.0.kasan.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-20220525_182517-szxrtosci10000 04/01/2014
Call Trace:
 dump_stack+0x9c/0xd3
 print_address_description.constprop.0+0x19/0x170
 __kasan_report.cold+0x6c/0x84
 kasan_report+0x3a/0x50
 check_memory_region+0xfd/0x1f0
 mpi_key_length+0x34/0xb0
 pgp_calc_pkey_keyid.isra.0+0x100/0x5a0
 pgp_generate_fingerprint+0x159/0x330
 pgp_process_public_key+0x1c5/0x330
 pgp_parse_packets+0xf4/0x200
 pgp_key_parse+0xb6/0x340
 asymmetric_key_preparse+0x8a/0x120
 key_create_or_update+0x31f/0x8c0
 __se_sys_add_key+0x23e/0x400
 do_syscall_64+0x30/0x40
 entry_SYSCALL_64_after_hwframe+0x61/0xc6

The root cause of the issue is that pgp_calc_pkey_keyid() would call
mpi_key_length() and get the length of the public key. The length was
then ducted from keylen, which is an unsigned value. However, the
returnd byte count is not checked for legitimacy in mpi_key_length(),
resulting in an inverted keylen, hence the read overflow.

It turns out that the byte count check was mistakenly placed in
mpi_read_from_buffer() while commit 94479061 ("mpi: introduce
mpi_key_length()") tries to extract mpi_key_length() out of
mpi_read_from_buffer(). This patch moves the check into
mpi_key_length().

Fixes: commit 94479061 ("mpi: introduce mpi_key_length()")
Signed-off-by: NGUO Zihua <guozihua@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4142005a

Revert "posix-cpu-timers: Make timespec to nsec conversion safe" · 64b534dc

由 Yuyao Lin 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I61XP8

--------------------------------

This reverts commit 098b0e01.

Function timespec64_to_ns() Add the upper and lower limits check in
commit cb477557 ("time: Prevent undefined behaviour in timespec64_to_ns()"),
timespec64_to_ktime() only check the upper limits,so revert this patch
can fix overflow.
Signed-off-by: NYuyao Lin <linyuyao1@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

64b534dc

ext4: fix bug in extents parsing when eh_entries == 0 and eh_depth > 0 · 0514826e

由 Luís Henriques 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.146
commit 958b0ee23f5ac106e7cc11472b71aa2ea9a033bc
category: bugfix
bugzilla: 187444, https://gitee.com/openeuler/kernel/issues/I6261Z
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=958b0ee23f5ac106e7cc11472b71aa2ea9a033bc

--------------------------------

commit 29a5b8a1 upstream.

When walking through an inode extents, the ext4_ext_binsearch_idx() function
assumes that the extent header has been previously validated.  However, there
are no checks that verify that the number of entries (eh->eh_entries) is
non-zero when depth is > 0.  And this will lead to problems because the
EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

[  135.245946] ------------[ cut here ]------------
[  135.247579] kernel BUG at fs/ext4/extents.c:2258!
[  135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
[  135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
[  135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
[  135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
[  135.256475] Code:
[  135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
[  135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
[  135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
[  135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
[  135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
[  135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
[  135.272394] FS:  00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
[  135.274510] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
[  135.277952] Call Trace:
[  135.278635]  <TASK>
[  135.279247]  ? preempt_count_add+0x6d/0xa0
[  135.280358]  ? percpu_counter_add_batch+0x55/0xb0
[  135.281612]  ? _raw_read_unlock+0x18/0x30
[  135.282704]  ext4_map_blocks+0x294/0x5a0
[  135.283745]  ? xa_load+0x6f/0xa0
[  135.284562]  ext4_mpage_readpages+0x3d6/0x770
[  135.285646]  read_pages+0x67/0x1d0
[  135.286492]  ? folio_add_lru+0x51/0x80
[  135.287441]  page_cache_ra_unbounded+0x124/0x170
[  135.288510]  filemap_get_pages+0x23d/0x5a0
[  135.289457]  ? path_openat+0xa72/0xdd0
[  135.290332]  filemap_read+0xbf/0x300
[  135.291158]  ? _raw_spin_lock_irqsave+0x17/0x40
[  135.292192]  new_sync_read+0x103/0x170
[  135.293014]  vfs_read+0x15d/0x180
[  135.293745]  ksys_read+0xa1/0xe0
[  135.294461]  do_syscall_64+0x3c/0x80
[  135.295284]  entry_SYSCALL_64_after_hwframe+0x46/0xb0

This patch simply adds an extra check in __ext4_ext_check(), verifying that
eh_entries is not 0 when eh_depth is > 0.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
Cc: Baokun Li <libaokun1@huawei.com>
Cc: stable@kernel.org
Signed-off-by: NLuís Henriques <lhenriques@suse.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NBaokun Li <libaokun1@huawei.com>
Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.deSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0514826e

bpf, sockmap: fix sk_rmem_alloc underflow for sockmap · 306d609d

由 Ziyang Xuan 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I61PL4
CVE: NA

--------------------------------

Under sockmap redirect scenario, destroy sock when psock->ingress_msg
is not empty. Get a warning as following:

=================================================
WARNING: CPU: 0 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x408/0x430
...
Call Trace:
 <IRQ>
 __sk_destruct+0x3d/0x590 net/core/sock.c:1784
 sk_destruct net/core/sock.c:1829 [inline]
 __sk_free+0x106/0x2a0 net/core/sock.c:1840
 sk_free+0x7d/0xb0 net/core/sock.c:1851
 sock_put include/net/sock.h:1813 [inline]
 tcp_v4_rcv+0x23af/0x26e0 net/ipv4/tcp_ipv4.c:2085
 ip_protocol_deliver_rcu+0xe5/0x440 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0xd2/0x110 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:304 [inline]
 ip_local_deliver+0x10a/0x260 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:459 [inline]
 ip_rcv_finish+0x126/0x160 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:304 [inline]
 ip_rcv+0xbf/0x1d0 net/ipv4/ip_input.c:539
 __netif_receive_skb_one_core+0x15f/0x190 net/core/dev.c:5366
 __netif_receive_skb+0x2e/0xe0 net/core/dev.c:5480
 process_backlog+0x132/0x2c0 net/core/dev.c:6386
 napi_poll+0x17e/0x4f0 net/core/dev.c:6837
 net_rx_action+0x183/0x3c0 net/core/dev.c:6907

That is because commit 7e41dfae18b1 ("[Huawei] bpf, sockmap: Add
sk_rmem_alloc check for sockmap") does not consider redirect scenario,
reduce sk_rmem_alloc without increasing sk_rmem_alloc. That would result
in sk_rmem_alloc underflow.

Fixes: 8818e269 ("bpf, sockmap: Add sk_rmem_alloc check for sockmap")
Signed-off-by: NZiyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

306d609d

sched/fair:ARM64 enables SIS_UTIL and disables SIS_PROP · 47559d11

由 Guan Jing 提交于 11月 30, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M
CVE: NA

--------------------------------

When doing wakeups, attempt to limit superfluous scans of the LLC domain.
ARM64 enables SIS_UTIL and disables SIS_PROP to search idle CPU based on
sum of util_avg.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

47559d11

sched/fair: Fix kabi borken in sched_domain_shared · 222d84a0

由 Guan Jing 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M
CVE: NA

--------------------------------

The sched_domain_shared structure is only used as pointer, and other
drivers don't use it directly.
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: Nzhangjialin <zhangjialin11@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

222d84a0

sched/fair: Introduce SIS_UTIL to search idle CPU based on sum of util_avg · bb29df9f

由 Chen Yu 提交于 11月 30, 2022

mainline inclusion
from mainline-v6.0-rc1
commit 70fb5ccf
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I61E4M

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=70fb5ccf2ebb09a0c8ebba775041567812d45

--------------------------------

[Problem Statement]
select_idle_cpu() might spend too much time searching for an idle CPU,
when the system is overloaded.

The following histogram is the time spent in select_idle_cpu(),
when running 224 instances of netperf on a system with 112 CPUs
per LLC domain:

@usecs:
[0]                  533 |                                                    |
[1]                 5495 |                                                    |
[2, 4)             12008 |                                                    |
[4, 8)            239252 |                                                    |
[8, 16)          4041924 |@@@@@@@@@@@@@@                                      |
[16, 32)        12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@         |
[32, 64)        14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[64, 128)       13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@       |
[128, 256)       8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@                        |
[256, 512)       4507667 |@@@@@@@@@@@@@@@                                     |
[512, 1K)        2600472 |@@@@@@@@@                                           |
[1K, 2K)          927912 |@@@                                                 |
[2K, 4K)          218720 |                                                    |
[4K, 8K)           98161 |                                                    |
[8K, 16K)          37722 |                                                    |
[16K, 32K)          6715 |                                                    |
[32K, 64K)           477 |                                                    |
[64K, 128K)            7 |                                                    |

netperf latency usecs:

=======
case            	load    	    Lat_99th	    std%
TCP_RR          	thread-224	      257.39	(  0.21)

The time spent in select_idle_cpu() is visible to netperf and might have a negative
impact.

[Symptom analysis]
The patch [1] from Mel Gorman has been applied to track the efficiency
of select_idle_sibling. Copy the indicators here:

SIS Search Efficiency(se_eff%):
        A ratio expressed as a percentage of runqueues scanned versus
        idle CPUs found. A 100% efficiency indicates that the target,
        prev or recent CPU of a task was idle at wakeup. The lower the
        efficiency, the more runqueues were scanned before an idle CPU
        was found.

SIS Domain Search Efficiency(dom_eff%):
        Similar, except only for the slower SIS
	patch.

SIS Fast Success Rate(fast_rate%):
        Percentage of SIS that used target, prev or
	recent CPUs.

SIS Success rate(success_rate%):
        Percentage of scans that found an idle CPU.

The test is based on Aubrey's schedtests tool, including netperf, hackbench,
schbench and tbench.

Test on vanilla kernel:
schedstat_parse.py -f netperf_vanilla.log
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
TCP_RR	   28 threads	     99.978	      18.535	      99.995	     100.000
TCP_RR	   56 threads	     99.397	       5.671	      99.964	     100.000
TCP_RR	   84 threads	     21.721	       6.818	      73.632	     100.000
TCP_RR	  112 threads	     12.500	       5.533	      59.000	     100.000
TCP_RR	  140 threads	      8.524	       4.535	      49.020	     100.000
TCP_RR	  168 threads	      6.438	       3.945	      40.309	      99.999
TCP_RR	  196 threads	      5.397	       3.718	      32.320	      99.982
TCP_RR	  224 threads	      4.874	       3.661	      25.775	      99.767
UDP_RR	   28 threads	     99.988	      17.704	      99.997	     100.000
UDP_RR	   56 threads	     99.528	       5.977	      99.970	     100.000
UDP_RR	   84 threads	     24.219	       6.992	      76.479	     100.000
UDP_RR	  112 threads	     13.907	       5.706	      62.538	     100.000
UDP_RR	  140 threads	      9.408	       4.699	      52.519	     100.000
UDP_RR	  168 threads	      7.095	       4.077	      44.352	     100.000
UDP_RR	  196 threads	      5.757	       3.775	      35.764	      99.991
UDP_RR	  224 threads	      5.124	       3.704	      28.748	      99.860

schedstat_parse.py -f schbench_vanilla.log
(each group has 28 tasks)
case	        load	    se_eff%	    dom_eff%	  fast_rate%	success_rate%
normal	   1   mthread	     99.152	       6.400	      99.941	     100.000
normal	   2   mthreads	     97.844	       4.003	      99.908	     100.000
normal	   3   mthreads	     96.395	       2.118	      99.917	      99.998
normal	   4   mthreads	     55.288	       1.451	      98.615	      99.804
normal	   5   mthreads	      7.004	       1.870	      45.597	      61.036
normal	   6   mthreads	      3.354	       1.346	      20.777	      34.230
normal	   7   mthreads	      2.183	       1.028	      11.257	      21.055
normal	   8   mthreads	      1.653	       0.825	       7.849	      15.549

schedstat_parse.py -f hackbench_vanilla.log
(each group has 28 tasks)
case			load	        se_eff%	    dom_eff%	  fast_rate%	success_rate%
process-pipe	     1 group	         99.991	       7.692	      99.999	     100.000
process-pipe	    2 groups	         99.934	       4.615	      99.997	     100.000
process-pipe	    3 groups	         99.597	       3.198	      99.987	     100.000
process-pipe	    4 groups	         98.378	       2.464	      99.958	     100.000
process-pipe	    5 groups	         27.474	       3.653	      89.811	      99.800
process-pipe	    6 groups	         20.201	       4.098	      82.763	      99.570
process-pipe	    7 groups	         16.423	       4.156	      77.398	      99.316
process-pipe	    8 groups	         13.165	       3.920	      72.232	      98.828
process-sockets	     1 group	         99.977	       5.882	      99.999	     100.000
process-sockets	    2 groups	         99.927	       5.505	      99.996	     100.000
process-sockets	    3 groups	         99.397	       3.250	      99.980	     100.000
process-sockets	    4 groups	         79.680	       4.258	      98.864	      99.998
process-sockets	    5 groups	          7.673	       2.503	      63.659	      92.115
process-sockets	    6 groups	          4.642	       1.584	      58.946	      88.048
process-sockets	    7 groups	          3.493	       1.379	      49.816	      81.164
process-sockets	    8 groups	          3.015	       1.407	      40.845	      75.500
threads-pipe	     1 group	         99.997	       0.000	     100.000	     100.000
threads-pipe	    2 groups	         99.894	       2.932	      99.997	     100.000
threads-pipe	    3 groups	         99.611	       4.117	      99.983	     100.000
threads-pipe	    4 groups	         97.703	       2.624	      99.937	     100.000
threads-pipe	    5 groups	         22.919	       3.623	      87.150	      99.764
threads-pipe	    6 groups	         18.016	       4.038	      80.491	      99.557
threads-pipe	    7 groups	         14.663	       3.991	      75.239	      99.247
threads-pipe	    8 groups	         12.242	       3.808	      70.651	      98.644
threads-sockets	     1 group	         99.990	       6.667	      99.999	     100.000
threads-sockets	    2 groups	         99.940	       5.114	      99.997	     100.000
threads-sockets	    3 groups	         99.469	       4.115	      99.977	     100.000
threads-sockets	    4 groups	         87.528	       4.038	      99.400	     100.000
threads-sockets	    5 groups	          6.942	       2.398	      59.244	      88.337
threads-sockets	    6 groups	          4.359	       1.954	      49.448	      87.860
threads-sockets	    7 groups	          2.845	       1.345	      41.198	      77.102
threads-sockets	    8 groups	          2.871	       1.404	      38.512	      74.312

schedstat_parse.py -f tbench_vanilla.log
case			load	      se_eff%	    dom_eff%	  fast_rate%	success_rate%
loopback	  28 threads	       99.976	      18.369	      99.995	     100.000
loopback	  56 threads	       99.222	       7.799	      99.934	     100.000
loopback	  84 threads	       19.723	       6.819	      70.215	     100.000
loopback	 112 threads	       11.283	       5.371	      55.371	      99.999
loopback	 140 threads	        0.000	       0.000	       0.000	       0.000
loopback	 168 threads	        0.000	       0.000	       0.000	       0.000
loopback	 196 threads	        0.000	       0.000	       0.000	       0.000
loopback	 224 threads	        0.000	       0.000	       0.000	       0.000

According to the test above, if the system becomes busy, the
SIS Search Efficiency(se_eff%) drops significantly. Although some
benchmarks would finally find an idle CPU(success_rate% = 100%), it is
doubtful whether it is worth it to search the whole LLC domain.

[Proposal]
It would be ideal to have a crystal ball to answer this question:
How many CPUs must a wakeup path walk down, before it can find an idle
CPU? Many potential metrics could be used to predict the number.
One candidate is the sum of util_avg in this LLC domain. The benefit
of choosing util_avg is that it is a metric of accumulated historic
activity, which seems to be smoother than instantaneous metrics
(such as rq->nr_running). Besides, choosing the sum of util_avg
would help predict the load of the LLC domain more precisely, because
SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
time.

In summary, the lower the util_avg is, the more select_idle_cpu()
should scan for idle CPU, and vice versa. When the sum of util_avg
in this LLC domain hits 85% or above, the scan stops. The reason to
choose 85% as the threshold is that this is the imbalance_pct(117)
when a LLC sched group is overloaded.

Introduce the quadratic function:

y = SCHED_CAPACITY_SCALE - p * x^2
and y'= y / SCHED_CAPACITY_SCALE

x is the ratio of sum_util compared to the CPU capacity:
x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
y' is the ratio of CPUs to be scanned in the LLC domain,
and the number of CPUs to scan is calculated by:

nr_scan = llc_weight * y'

Choosing quadratic function is because:
[1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
[2] Compared to the exponential function, it is easier to calculate.
[3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr   112  111  108  102  93  81  65   47   25    1    0 ...

For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
sum_util%   0    5   15   25  35  45  55   65   75   85   86 ...
scan_nr    16   15   15   14  13  11   9    6    3    0    0 ...

Furthermore, to minimize the overhead of calculating the metrics in
select_idle_cpu(), borrow the statistics from periodic load balance.
As mentioned by Abel, on a platform with 112 CPUs per LLC, the
sum_util calculated by periodic load balance after 112 ms would
decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
in reflecting the latest utilization. But it is a trade-off.
Checking the util_avg in newidle load balance would be more frequent,
but it brings overhead - multiple CPUs write/read the per-LLC shared
variable and introduces cache contention. Tim also mentioned that,
it is allowed to be non-optimal in terms of scheduling for the
short-term variations, but if there is a long-term trend in the load
behavior, the scheduler can adjust for that.

When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
Mel suggested, SIS_UTIL should be enabled by default.

This patch is based on the util_avg, which is very sensitive to the
CPU frequency invariance. There is an issue that, when the max frequency
has been clamp, the util_avg would decay insanely fast when
the CPU is idle. Commit addca285 ("cpufreq: intel_pstate: Handle no_turbo
in frequency invariance") could be used to mitigate this symptom, by adjusting
the arch_max_freq_ratio when turbo is disabled. But this issue is still
not thoroughly fixed, because the current code is unaware of the user-specified
max CPU frequency.

[Test result]

netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
175% 200% of CPU number respectively. Hackbench and schbench were launched
by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

The following is the benchmark result comparison between
baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
indicates better performance.

Each netperf test is a:
netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
netperf.throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	28 threads	 1.00 (  0.34)	 -0.16 (  0.40)
TCP_RR          	56 threads	 1.00 (  0.19)	 -0.02 (  0.20)
TCP_RR          	84 threads	 1.00 (  0.39)	 -0.47 (  0.40)
TCP_RR          	112 threads	 1.00 (  0.21)	 -0.66 (  0.22)
TCP_RR          	140 threads	 1.00 (  0.19)	 -0.69 (  0.19)
TCP_RR          	168 threads	 1.00 (  0.18)	 -0.48 (  0.18)
TCP_RR          	196 threads	 1.00 (  0.16)	+194.70 ( 16.43)
TCP_RR          	224 threads	 1.00 (  0.16)	+197.30 (  7.85)
UDP_RR          	28 threads	 1.00 (  0.37)	 +0.35 (  0.33)
UDP_RR          	56 threads	 1.00 ( 11.18)	 -0.32 (  0.21)
UDP_RR          	84 threads	 1.00 (  1.46)	 -0.98 (  0.32)
UDP_RR          	112 threads	 1.00 ( 28.85)	 -2.48 ( 19.61)
UDP_RR          	140 threads	 1.00 (  0.70)	 -0.71 ( 14.04)
UDP_RR          	168 threads	 1.00 ( 14.33)	 -0.26 ( 11.16)
UDP_RR          	196 threads	 1.00 ( 12.92)	+186.92 ( 20.93)
UDP_RR          	224 threads	 1.00 ( 11.74)	+196.79 ( 18.62)

Take the 224 threads as an example, the SIS search metrics changes are
illustrated below:

    vanilla                    patched
   4544492          +237.5%   15338634        sched_debug.cpu.sis_domain_search.avg
     38539        +39686.8%   15333634        sched_debug.cpu.sis_failed.avg
  128300000          -87.9%   15551326        sched_debug.cpu.sis_scanned.avg
   5842896          +162.7%   15347978        sched_debug.cpu.sis_search.avg

There is -87.9% less CPU scans after patched, which indicates lower overhead.
Besides, with this patch applied, there is -13% less rq lock contention
in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
.try_to_wake_up.default_wake_function.woken_wake_function.
This might help explain the performance improvement - Because this patch allows
the waking task to remain on the previous CPU, rather than grabbing other CPUs'
lock.

Each hackbench test is a:
hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
hackbench.throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1 group 	 1.00 (  1.29)	 +0.57 (  0.47)
process-pipe    	2 groups 	 1.00 (  0.27)	 +0.77 (  0.81)
process-pipe    	4 groups 	 1.00 (  0.26)	 +1.17 (  0.02)
process-pipe    	8 groups 	 1.00 (  0.15)	 -4.79 (  0.02)
process-sockets 	1 group 	 1.00 (  0.63)	 -0.92 (  0.13)
process-sockets 	2 groups 	 1.00 (  0.03)	 -0.83 (  0.14)
process-sockets 	4 groups 	 1.00 (  0.40)	 +5.20 (  0.26)
process-sockets 	8 groups 	 1.00 (  0.04)	 +3.52 (  0.03)
threads-pipe    	1 group 	 1.00 (  1.28)	 +0.07 (  0.14)
threads-pipe    	2 groups 	 1.00 (  0.22)	 -0.49 (  0.74)
threads-pipe    	4 groups 	 1.00 (  0.05)	 +1.88 (  0.13)
threads-pipe    	8 groups 	 1.00 (  0.09)	 -4.90 (  0.06)
threads-sockets 	1 group 	 1.00 (  0.25)	 -0.70 (  0.53)
threads-sockets 	2 groups 	 1.00 (  0.10)	 -0.63 (  0.26)
threads-sockets 	4 groups 	 1.00 (  0.19)	+11.92 (  0.24)
threads-sockets 	8 groups 	 1.00 (  0.08)	 +4.31 (  0.11)

Each tbench test is a:
tbench -t 100 $job 127.0.0.1
tbench.throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	28 threads	 1.00 (  0.06)	 -0.14 (  0.09)
loopback        	56 threads	 1.00 (  0.03)	 -0.04 (  0.17)
loopback        	84 threads	 1.00 (  0.05)	 +0.36 (  0.13)
loopback        	112 threads	 1.00 (  0.03)	 +0.51 (  0.03)
loopback        	140 threads	 1.00 (  0.02)	 -1.67 (  0.19)
loopback        	168 threads	 1.00 (  0.38)	 +1.27 (  0.27)
loopback        	196 threads	 1.00 (  0.11)	 +1.34 (  0.17)
loopback        	224 threads	 1.00 (  0.11)	 +1.67 (  0.22)

Each schbench test is a:
schbench -m $job -t 28 -r 100 -s 30000 -c 30000
schbench.latency_90%_us
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1 mthread	 1.00 ( 31.22)	 -7.36 ( 20.25)*
normal          	2 mthreads	 1.00 (  2.45)	 -0.48 (  1.79)
normal          	4 mthreads	 1.00 (  1.69)	 +0.45 (  0.64)
normal          	8 mthreads	 1.00 (  5.47)	 +9.81 ( 14.28)

*Consider the Standard Deviation, this -7.36% regression might not be valid.

Also, a OLTP workload with a commercial RDBMS has been tested, and there
is no significant change.

There were concerns that unbalanced tasks among CPUs would cause problems.
For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
bound to CPU0~CPU6, while CPU7 is idle:

          CPU0    CPU1    CPU2    CPU3    CPU4    CPU5    CPU6    CPU7
util_avg  1024    1024    1024    1024    1024    1024    1024    0

Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
But according to Mel, it is unlikely the CPU7 will be idle all the time
because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

lkp(kernel test robot) has reported a regression on stress-ng.sock on a
very busy system. According to the sched_debug statistics, it might be caused
by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
might introduce more context switch, especially involuntary preemption, which
impacts a busy stress-ng. This regression has shown that, not all benchmarks
in every scenario benefit from idle CPU scan limit, and it needs further
investigation.

Besides, there is slight regression in hackbench's 16 groups case when the
LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
in an LLC domain with 16 CPUs. Because the cost to search for an idle one
among 16 CPUs is negligible. The current patch aims to propose a generic
solution and only considers the util_avg. Something like the below could
be applied on top of the current patch to fulfill the requirement:

	if (llc_weight <= 16)
		nr_scan = nr_scan * 32 / llc_weight;

For LLC domain with 16 CPUs, the nr_scan will be expanded to 2 times large.
The smaller the CPU number this LLC domain has, the larger nr_scan will be
expanded. This needs further investigation.

There is also ongoing work[2] from Abel to filter out the busy CPUs during
wakeup, to further speed up the idle CPU scan. And it could be a following-up
optimization on top of this change.
Suggested-by: NTim Chen <tim.c.chen@intel.com>
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NChen Yu <yu.c.chen@intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NYicong Yang <yangyicong@hisilicon.com>
Tested-by: NMohini Narkhede <mohini.narkhede@intel.com>
Tested-by: NK Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.comSigned-off-by: NJialin Zhang <zhangjialin11@huawei.com>
Signed-off-by: NGuan Jing <guanjing6@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bb29df9f

block: check flags of claimed slave bdev to fix uaf for bd_holder_dir · 3d49537d

由 Li Lingfeng 提交于 11月 30, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60QE9
CVE: NA

--------------------------------

As explained in 32c39e8a ("block: fix use after free for
bd_holder_dir"), we should make sure the "disk" is still live and
then grab a reference to 'bd_holder_dir'. However, the "disk"
should be "the claimed slave bdev" rather than "the holding disk".

Fixes: 32c39e8a ("block: fix use after free for bd_holder_dir")
Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3d49537d

pinctrl: core: Set ret to 0 when group is skipped · c183e39f

由 Michal Simek 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.13-rc1
commit 6a37d750
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60OLE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6a37d750037827d385672acdebf5788fc2ffa633

--------------------------------

Static analyzer tool found that the ret variable is not initialized but
code expects ret value >=0 when pinconf is skipped in the first pinmux
loop. The same expectation is for pinmux in a pinconf loop.
That's why initialize ret to 0 to avoid uninitialized ret value in first
loop or reusing ret value from first loop in second.

Addresses-Coverity: ("Uninitialized variables")
Signed-off-by: NMichal Simek <michal.simek@xilinx.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NColin Ian King <colin.king@canonical.com>
Link: https://lore.kernel.org/r/e5203bae68eb94b4b8b4e67e5e7b4d86bb989724.1615534291.git.michal.simek@xilinx.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NYuyao Lin <linyuyao1@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c183e39f

pinctrl: core: Handling pinmux and pinconf separately · 88309bcc

由 Michal Simek 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.13-rc1
commit b991f8c3
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60OLE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b991f8c3622c8c9d01a1ada382682a731932e651

--------------------------------

Right now the handling order depends on how entries are coming which is
corresponding with order in DT. We have reached the case with DT overlays
where conf and mux descriptions are exchanged which ends up in sequence
that firmware has been asked to perform configuration before requesting the
pin.

The patch is enforcing the order that pin is requested all the time first
followed by pin configuration. This change will ensure that firmware gets
requests in the right order.
Signed-off-by: NMichal Simek <michal.simek@xilinx.com>
Link: https://lore.kernel.org/r/cfbe01f791c2dd42a596cbda57e15599969b57aa.1615364211.git.michal.simek@xilinx.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
Signed-off-by: NYuyao Lin <linyuyao1@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

88309bcc

blk-mq: fix null pointer dereference in blk_mq_clear_rq_mapping() · 587fc98d

由 Yu Kuai 提交于 11月 30, 2022

mainline inclusion
from mainline-v5.16-rc2
commit 76dd2980
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VGU9
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=76dd298094f484c6250ebd076fa53287477b2328

--------------------------------

Our syzkaller report a null pointer dereference, root cause is
following:

__blk_mq_alloc_map_and_rqs
 set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs
  blk_mq_alloc_map_and_rqs
   blk_mq_alloc_rqs
    // failed due to oom
    alloc_pages_node
    // set->tags[hctx_idx] is still NULL
    blk_mq_free_rqs
     drv_tags = set->tags[hctx_idx];
     // null pointer dereference is triggered
     blk_mq_clear_rq_mapping(drv_tags, ...)

This is because commit 63064be1 ("blk-mq:
Add blk_mq_alloc_map_and_rqs()") merged the two steps:

1) set->tags[hctx_idx] = blk_mq_alloc_rq_map()
2) blk_mq_alloc_rqs(..., set->tags[hctx_idx])

into one step:

set->tags[hctx_idx] = blk_mq_alloc_map_and_rqs()

Since tags is not initialized yet in this case, fix the problem by
checking if tags is NULL pointer in blk_mq_clear_rq_mapping().

Fixes: 63064be1 ("blk-mq: Add blk_mq_alloc_map_and_rqs()")
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJohn Garry <john.garry@huawei.com>
Link: https://lore.kernel.org/r/20221011142253.4015966-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

587fc98d

blk-wbt: fix that 'rwb->wc' is always set to 1 in wbt_init() · 832337cc

由 Yu Kuai 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.152
commit 31b1570677e8bf85f48be8eb95e21804399b8295
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=31b1570677e8bf85f48be8eb95e21804399b8295

-------------------------------

commit 285febab upstream.

commit 8c5035df ("blk-wbt: call rq_qos_add() after wb_normal is
initialized") moves wbt_set_write_cache() before rq_qos_add(), which
is wrong because wbt_rq_qos() is still NULL.

Fix the problem by removing wbt_set_write_cache() and setting 'rwb->wc'
directly. Noted that this patch also remove the redundant setting of
'rab->wc'.

Fixes: 8c5035df ("blk-wbt: call rq_qos_add() after wb_normal is initialized")
Reported-by: Nkernel test robot <yujie.liu@intel.com>
Link: https://lore.kernel.org/r/202210081045.77ddf59b-yujie.liu@intel.comSigned-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20221009101038.1692875-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

832337cc

blk-wbt: call rq_qos_add() after wb_normal is initialized · 7a8507ed

由 Yu Kuai 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.152
commit 910ba49b33450a878128adc7d9c419dd97efd923
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=910ba49b33450a878128adc7d9c419dd97efd923

-------------------------------

commit 8c5035df upstream.

Our test found a problem that wbt inflight counter is negative, which
will cause io hang(noted that this problem doesn't exist in mainline):

t1: device create	t2: issue io
add_disk
 blk_register_queue
  wbt_enable_default
   wbt_init
    rq_qos_add
    // wb_normal is still 0
			/*
			 * in mainline, disk can't be opened before
			 * bdev_add(), however, in old kernels, disk
			 * can be opened before blk_register_queue().
			 */
			blkdev_issue_flush
                        // disk size is 0, however, it's not checked
                         submit_bio_wait
                          submit_bio
                           blk_mq_submit_bio
                            rq_qos_throttle
                             wbt_wait
			      bio_to_wbt_flags
                               rwb_enabled
			       // wb_normal is 0, inflight is not increased

    wbt_queue_depth_changed(&rwb->rqos);
     wbt_update_limits
     // wb_normal is initialized
                            rq_qos_track
                             wbt_track
                              rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
			      // wb_normal is not 0，wbt_flags will be set
t3: io completion
blk_mq_free_request
 rq_qos_done
  wbt_done
   wbt_is_tracked
   // return true
   __wbt_done
    wbt_rqw_done
     atomic_dec_return(&rqw->inflight);
     // inflight is decreased

commit 8235b5c1 ("block: call bdev_add later in device_add_disk") can
avoid this problem, however it's better to fix this problem in wbt:

1) Lower kernel can't backport this patch due to lots of refactor.
2) Root cause is that wbt call rq_qos_add() before wb_normal is
initialized.

Fixes: e34cbd30 ("blk-wbt: add general throttling mechanism")
Cc: <stable@vger.kernel.org>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220913105749.3086243-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7a8507ed

block: wbt: Remove unnecessary invoking of wbt_update_limits in wbt_init · d87081c7

由 Lei Chen 提交于 11月 30, 2022

stable inclusion
from stable-v5.10.152
commit 392536023da18086d57565e716ed50193869b8e7
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I60HVY
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=392536023da18086d57565e716ed50193869b8e7

-------------------------------

commit 5a20d073 upstream.

It's unnecessary to call wbt_update_limits explicitly within wbt_init,
because it will be called in the following function wbt_queue_depth_changed.
Signed-off-by: NLei Chen <lennychen@tencent.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d87081c7

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功