提交 · c6aaa31066768a37bef9ec3a57f1d4ac07670c96 · openeuler / Kernel

01 6月, 2023 28 次提交

sched: fix performance degradation on lmbench · c6aaa310

由 Hui Tang 提交于 6月 01, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718

--------------------------------

There are worse performance with the 'Fixes'
when running "./lat_ctx -P $SYNC_MAX -s 64 16".

The 'Fixes' which allocates memory for p->prefer_cpus
even if "prefer_cpus" not be set.

Before the 'Fixes', only test "p->prefer_cpus",
after, add test "!cpumask_empty(p->prefer_cpus)"
which causing performance degradation.

select_task_rq_fair
  ->set_task_select_cpus
    ->prefer_cpus_valid  ----  test cpumask_empty(p->prefer_cpus)

Fixes: ebeb84ad ("cpuset: Introduce new interface for scheduler ...")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
(cherry picked from commit d8f77f89)

c6aaa310

!871 [sync] PR-866: arm64: kdump: Avoid reserving low memory repeatedly · c1481312

由 openeuler-ci-bot 提交于 6月 01, 2023

Merge Pull Request from: @openeuler-sync-bot 
 

Origin pull request: 
https://gitee.com/openeuler/kernel/pulls/866 
 
PR sync from:  Li Huafei <lihuafei1@huawei.com>
 https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/thread/QXUYQDQ4DEHNVIFCOSKUQF5GRGQKLRPI/ 
 
 
Link:https://gitee.com/openeuler/kernel/pulls/871 

Reviewed-by: Jialin Zhang <zhangjialin11@huawei.com> 
Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>

c1481312

!872 [sync] PR-863: Backport CVEs and bugfixes · d6879a8e

由 openeuler-ci-bot 提交于 6月 01, 2023

Merge Pull Request from: @openeuler-sync-bot 
 

Origin pull request: 
https://gitee.com/openeuler/kernel/pulls/863 
 
PR sync from:  Jialin Zhang <zhangjialin11@huawei.com>
 https://mailweb.openeuler.org/hyperkitty/list/kernel@openeuler.org/thread/UAMVHA4ICOFJJXDMX2CXEV6TEZSY7Y7U/ 
Pull new CVEs:
CVE-2023-22998

cgroup bugfix from Gaosheng Cui
sched bugfix from Xia Fukun
block bugfixes from Zhong Jinghua and Yu Kuai
iomap and ext4 bugfixes from Baokun Li
md and eulerfs bugfixes from Yu Kuai

-- 
2.25.1
 
 
Link:https://gitee.com/openeuler/kernel/pulls/872 

Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

d6879a8e

drm/virtio: Fix error code in virtio_gpu_object_shmem_init() · 9bd94292

由 Harshit Mogalapalli 提交于 5月 31, 2023

stable inclusion
from stable-v5.10.173
commit c5fe3fba1b7bfecb6f17f93a433782b8500fe377
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c5fe3fba1b7bfecb6f17f93a433782b8500fe377

--------------------------------

In virtio_gpu_object_shmem_init() we are passing NULL to PTR_ERR, which
is returning 0/success.

Fix this by storing error value in 'ret' variable before assigning
shmem->pages to NULL.

Found using static analysis with Smatch.

Fixes: 64b88afb ("drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling")
Signed-off-by: NHarshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reviewed-by: NDmitry Osipenko <dmitry.osipenko@collabora.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 1c498218)

9bd94292

drm/virtio: Correct drm_gem_shmem_get_sg_table() error handling · c2033fc1

由 Dmitry Osipenko 提交于 5月 31, 2023

stable inclusion
from stable-v5.10.171
commit 87c647def389354c95263d6635c62ca0de7d12ca
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=87c647def389354c95263d6635c62ca0de7d12ca

--------------------------------

commit 64b88afb upstream.

Previous commit fixed checking of the ERR_PTR value returned by
drm_gem_shmem_get_sg_table(), but it missed to zero out the shmem->pages,
which will crash virtio_gpu_cleanup_object(). Add the missing zeroing of
the shmem->pages.

Fixes: c2496873 ("drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init")
Reviewed-by: NEmil Velikov <emil.l.velikov@gmail.com>
Signed-off-by: NDmitry Osipenko <dmitry.osipenko@collabora.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220630200726.1884320-2-dmitry.osipenko@collabora.comSigned-off-by: NGerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 98019109)

c2033fc1

drm/virtio: Fix NULL vs IS_ERR checking in virtio_gpu_object_shmem_init · c44e5c25

由 Miaoqian Lin 提交于 5月 31, 2023

stable inclusion
from stable-v5.10.171
commit 0a4181b23acf53e9c95b351df6a7891116b98f9b
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6IKWF
CVE: CVE-2023-22998

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0a4181b23acf53e9c95b351df6a7891116b98f9b

--------------------------------

commit c2496873 upstream.

Since drm_prime_pages_to_sg() function return error pointers.
The drm_gem_shmem_get_sg_table() function returns error pointers too.
Using IS_ERR() to check the return value to fix this.

Fixes: 2f2aa137 ("drm/virtio: move virtio_gpu_mem_entry initialization to new function")
Signed-off-by: NMiaoqian Lin <linmq006@gmail.com>
Link: http://patchwork.freedesktop.org/patch/msgid/20220602104223.54527-1-linmq006@gmail.comSigned-off-by: NGerd Hoffmann <kraxel@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NOvidiu Panait <ovidiu.panait@windriver.com>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit bb502cea)

c44e5c25

cgroup: Stop task iteration when rebinding subsystem · 7ad6b560

由 Xiu Jianfeng 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I798WQ
CVE: NA

----------------------------------------------------------------------

We found a refcount UAF bug as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
Workqueue: events cpuset_hotplug_workfn
Call trace:
 refcount_warn_saturate+0xa0/0x148
 __refcount_add.constprop.0+0x5c/0x80
 css_task_iter_advance_css_set+0xd8/0x210
 css_task_iter_advance+0xa8/0x120
 css_task_iter_next+0x94/0x158
 update_tasks_root_domain+0x58/0x98
 rebuild_root_domains+0xa0/0x1b0
 rebuild_sched_domains_locked+0x144/0x188
 cpuset_hotplug_workfn+0x138/0x5a0
 process_one_work+0x1e8/0x448
 worker_thread+0x228/0x3e0
 kthread+0xe0/0xf0
 ret_from_fork+0x10/0x20

then a kernel panic will be triggered as below:

Unable to handle kernel paging request at virtual address 00000000c0000010
Call trace:
 cgroup_apply_control_disable+0xa4/0x16c
 rebind_subsystems+0x224/0x590
 cgroup_destroy_root+0x64/0x2e0
 css_free_rwork_fn+0x198/0x2a0
 process_one_work+0x1d4/0x4bc
 worker_thread+0x158/0x410
 kthread+0x108/0x13c
 ret_from_fork+0x10/0x18

The race that cause this bug can be shown as below:

(hotplug cpu)                | (umount cpuset)
mutex_lock(&cpuset_mutex)    | mutex_lock(&cgroup_mutex)
cpuset_hotplug_workfn        |
 rebuild_root_domains        |  rebind_subsystems
  update_tasks_root_domain   |   spin_lock_irq(&css_set_lock)
   css_task_iter_start       |    list_move_tail(&cset->e_cset_node[ss->id]
   while(css_task_iter_next) |                  &dcgrp->e_csets[ss->id]);
   css_task_iter_end         |   spin_unlock_irq(&css_set_lock)
mutex_unlock(&cpuset_mutex)  | mutex_unlock(&cgroup_mutex)

Inside css_task_iter_start/next/end, css_set_lock is hold and then
released, so when iterating task(left side), the css_set may be moved to
another list(right side), then it->cset_head points to the old list head
and it->cset_pos->next points to the head node of new list, which can't
be used as struct css_set.

To fix this issue, introduce CSS_TASK_ITER_STOPPED flag for css_task_iter.
when moving css_set to dcgrp->e_csets[ss->id] in rebind_subsystems(), stop
the task iteration.
Reported-by: NGaosheng Cui <cuigaosheng1@huawei.com>
Link: https://www.spinics.net/lists/cgroups/msg37935.html
Fixes: f9a25f77 ("cpusets: Rebuild root domain deadline accounting information")
Signed-off-by: NXiu Jianfeng <xiujianfeng@huaweicloud.com>
Signed-off-by: NGaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e52586f4)

7ad6b560

sched/topology: Fix exceptional memory access in sd_llc_free_all() · 0a82b3e9

由 Xia Fukun 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6YJJQ
CVE: NA

----------------------------------------

The function sd_llc_free_all() will be called to release allocated
resources when space allocation for the scheduling domain
structure fails. However, this function did not check if sd
is a null pointer when releasing sdd resources, resulting in
an error: "Unable to handle kernel paging request at virtual
address".

Fix this issue by adding null pointer discrimination.

Fixes: 79bec4c6 ("sched/topology: Provide hooks to allocate data shared per LLC")
Signed-off-by: NXia Fukun <xiafukun@huawei.com>
Reviewed-by: Nsongping yu <yusongping@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit d73bbd3f)

0a82b3e9

block: Fix the partition start may overflow in add_partition() · e5ecbf78

由 Zhong Jinghua 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: 187268, https://gitee.com/openeuler/kernel/issues/I76JDY
CVE: NA

----------------------------------------

In the block_ioctl, we can pass in the unsigned number 0x8000000000000000
as an input parameter, like below:

block_ioctl
  blkdev_ioctl
    blkpg_ioctl
      blkpg_do_ioctl
        copy_from_user
        bdev_add_partition
          add_partition
            p->start_sect = start; // start = 0x8000000000000000

Then, there was an warning when submit bio:

WARNING: CPU: 0 PID: 382 at fs/iomap/apply.c:54
Call trace:
 iomap_apply+0x644/0x6e0
 __iomap_dio_rw+0x5cc/0xa24
 iomap_dio_rw+0x4c/0xcc
 ext4_dio_read_iter
 ext4_file_read_iter
 ext4_file_read_iter+0x318/0x39c
 call_read_iter
 lo_rw_aio.isra.0+0x748/0x75c
 do_req_filebacked+0x2d4/0x370
 loop_handle_cmd
 loop_queue_work+0x94/0x23c
 kthread_worker_fn+0x160/0x6bc
 loop_kthread_worker_fn+0x3c/0x50
 kthread+0x20c/0x25c
 ret_from_fork+0x10/0x18

Stack:

submit_bio_noacct
  submit_bio_checks
    blk_partition_remap
      bio->bi_iter.bi_sector += p->start_sect
      // bio->bi_iter.bi_sector = 0xffc0000000000000 + 65408
..
loop_queue_work
 loop_handle_cmd
  do_req_filebacked
   pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset // pos < 0
   lo_rw_aio
     call_read_iter
      ext4_dio_read_iter
       __iomap_dio_rw
        iomap_apply
         ext4_iomap_begin
           map.m_lblk = offset >> blkbits
             ext4_set_iomap
             iomap->offset = (u64) map->m_lblk << blkbits
             // iomap->offset = 64512
         WARN_ON(iomap.offset > pos) // iomap.offset = 64512 and pos < 0

This is unreasonable for start + length > disk->part0.nr_sects. There is
already a similar check in blk_add_partition().
Fix it by adding a check in bdev_add_partition().
Signed-off-by: NZhong Jinghua <zhongjinghua@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 1ae011cf)

e5ecbf78

ext4: avoid a potential slab-out-of-bounds in ext4_group_desc_csum · 2794f826

由 Tudor Ambarus 提交于 5月 31, 2023

stable inclusion
from stable-v5.10.180
commit 0dde3141c527b09b96bef1e7eeb18b8127810ce9
category: bugfix
bugzilla: 188791,https://gitee.com/openeuler/kernel/issues/I76XUJ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0dde3141c527b09b96bef1e7eeb18b8127810ce9

--------------------------------

commit 4f043518 upstream.

When modifying the block device while it is mounted by the filesystem,
syzbot reported the following:

BUG: KASAN: slab-out-of-bounds in crc16+0x206/0x280 lib/crc16.c:58
Read of size 1 at addr ffff888075f5c0a8 by task syz-executor.2/15586

CPU: 1 PID: 15586 Comm: syz-executor.2 Not tainted 6.2.0-rc5-syzkaller-00205-gc9661827 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/12/2023
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
 print_address_description+0x74/0x340 mm/kasan/report.c:306
 print_report+0x107/0x1f0 mm/kasan/report.c:417
 kasan_report+0xcd/0x100 mm/kasan/report.c:517
 crc16+0x206/0x280 lib/crc16.c:58
 ext4_group_desc_csum+0x81b/0xb20 fs/ext4/super.c:3187
 ext4_group_desc_csum_set+0x195/0x230 fs/ext4/super.c:3210
 ext4_mb_clear_bb fs/ext4/mballoc.c:6027 [inline]
 ext4_free_blocks+0x191a/0x2810 fs/ext4/mballoc.c:6173
 ext4_remove_blocks fs/ext4/extents.c:2527 [inline]
 ext4_ext_rm_leaf fs/ext4/extents.c:2710 [inline]
 ext4_ext_remove_space+0x24ef/0x46a0 fs/ext4/extents.c:2958
 ext4_ext_truncate+0x177/0x220 fs/ext4/extents.c:4416
 ext4_truncate+0xa6a/0xea0 fs/ext4/inode.c:4342
 ext4_setattr+0x10c8/0x1930 fs/ext4/inode.c:5622
 notify_change+0xe50/0x1100 fs/attr.c:482
 do_truncate+0x200/0x2f0 fs/open.c:65
 handle_truncate fs/namei.c:3216 [inline]
 do_open fs/namei.c:3561 [inline]
 path_openat+0x272b/0x2dd0 fs/namei.c:3714
 do_filp_open+0x264/0x4f0 fs/namei.c:3741
 do_sys_openat2+0x124/0x4e0 fs/open.c:1310
 do_sys_open fs/open.c:1326 [inline]
 __do_sys_creat fs/open.c:1402 [inline]
 __se_sys_creat fs/open.c:1396 [inline]
 __x64_sys_creat+0x11f/0x160 fs/open.c:1396
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f72f8a8c0c9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 19 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f72f97e3168 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 00007f72f8bac050 RCX: 00007f72f8a8c0c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000020000280
RBP: 00007f72f8ae7ae9 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd165348bf R14: 00007f72f97e3300 R15: 0000000000022000

Replace
	le16_to_cpu(sbi->s_es->s_desc_size)
with
	sbi->s_desc_size

It reduces ext4's compiled text size, and makes the code more efficient
(we remove an extra indirect reference and a potential byte
swap on big endian systems), and there is no downside. It also avoids the
potential KASAN / syzkaller failure, as a bonus.

Reported-by: syzbot+fc51227e7100c9294894@syzkaller.appspotmail.com
Reported-by: syzbot+8785e41224a3afd04321@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=70d28d11ab14bd7938f3e088365252aa923cff42
Link: https://syzkaller.appspot.com/bug?id=b85721b38583ecc6b5e72ff524c67302abbc30f3
Link: https://lore.kernel.org/all/000000000000ece18705f3b20934@google.com/
Fixes: 717d50e4 ("Ext4: Uninitialized Block Groups")
Cc: stable@vger.kernel.org
Signed-off-by: NTudor Ambarus <tudor.ambarus@linaro.org>
Link: https://lore.kernel.org/r/20230504121525.3275886-1-tudor.ambarus@linaro.orgSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 0d4053b9)

2794f826

iomap: don't invalidate folios after writeback errors · 9aba6809

由 Darrick J. Wong 提交于 5月 31, 2023

mainline inclusion
from mainline-v5.19-rc1
commit e9c3a8e8
category: bugfix
bugzilla: 188775, https://gitee.com/openeuler/kernel/issues/I73IFH

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e9c3a8e820ed0eeb2be05072f29f80d1b79f053b

--------------------------------

XFS has the unique behavior (as compared to the other Linux filesystems)
that on writeback errors it will completely invalidate the affected
folio and force the page cache to reread the contents from disk.  All
other filesystems leave the page mapped and up to date.

This is a rude awakening for user programs, since (in the case where
write fails but reread doesn't) file contents will appear to revert to
old disk contents with no notification other than an EIO on fsync.  This
might have been annoying back in the days when iomap dealt with one page
at a time, but with multipage folios, we can now throw away *megabytes*
worth of data for a single write error.

On *most* Linux filesystems, a program can respond to an EIO on write by
redirtying the entire file and scheduling it for writeback.  This isn't
foolproof, since the page that failed writeback is no longer dirty and
could be evicted, but programs that want to recover properly *also*
have to detect XFS and regenerate every write they've made to the file.

When running xfs/314 on arm64, I noticed a UAF when xfs_discard_folio
invalidates multipage folios that could be undergoing writeback.  If,
say, we have a 256K folio caching a mix of written and unwritten
extents, it's possible that we could start writeback of the first (say)
64K of the folio and then hit a writeback error on the next 64K.  We
then free the iop attached to the folio, which is really bad because
writeback completion on the first 64k will trip over the "blocks per
folio > 1 && !iop" assertion.

This can't be fixed by only invalidating the folio if writeback fails at
the start of the folio, since the folio is marked !uptodate, which trips
other assertions elsewhere.  Get rid of the whole behavior entirely.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

Conflicts:
	fs/xfs/xfs_aops.c
	fs/iomap/buffered-io.c
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 6d98d507)

9aba6809

iomap: Don't create iomap_page objects in iomap_page_mkwrite_actor · 482ba5e5

由 Andreas Gruenbacher 提交于 5月 31, 2023

mainline inclusion
from mainline-v5.14-rc2
commit 229adf3c
category: bugfix
bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=229adf3c64dbeae4e2f45fb561907ada9fcc0d0c

--------------------------------

Now that we create those objects in iomap_writepage_map when needed,
there's no need to pre-create them in iomap_page_mkwrite_actor anymore.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 9df33786)

482ba5e5

iomap: Don't create iomap_page objects for inline files · e31a56a9

由 Andreas Gruenbacher 提交于 5月 31, 2023

mainline inclusion
from mainline-v5.14-rc2
commit 637d3375
category: bugfix
bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=637d3375953e052a62c0db409557e3b3354be88a

--------------------------------

In iomap_readpage_actor, don't create iop objects for inline inodes.
Otherwise, iomap_read_inline_data will set PageUptodate without setting
iop->uptodate, and iomap_page_release will eventually complain.

To prevent this kind of bug from occurring in the future, make sure the
page doesn't have private data attached in iomap_read_inline_data.
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e841540c)

e31a56a9

iomap: Permit pages without an iop to enter writeback · b6c52453

由 Andreas Gruenbacher 提交于 5月 31, 2023

mainline inclusion
from mainline-v5.14-rc2
commit 8e1bcef8
category: bugfix
bugzilla: 188764, https://gitee.com/openeuler/kernel/issues/I736LW

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8e1bcef8e18d0fec4afe527c074bb1fd6c2b140c

--------------------------------

Create an iop in the writeback path if one doesn't exist.  This allows us
to avoid creating the iop in some cases.  We'll initially do that for pages
with inline data, but it can be extended to pages which are entirely within
an extent.  It also allows for an iop to be removed from pages in the
future (eg page split).
Co-developed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e6eaa18c)

b6c52453

eulerfs: fix null-ptr-dereference when allocate page failed · b13a8d3d

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I78RYS
CVE: NA

--------------------------------

Currently, the caller of eufs_alloc_page() and eufs_zalloc_page() expect
that allocation won't fail, otherwise null_ptr_dereference will be
triggered.

Fix this problem by adding flag __GFP_NOFAIL.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 7b99df55)

b13a8d3d

eulerfs: add error handling for nv_init() · aa182265

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I78RUK
CVE: NA

--------------------------------

Currently nv_init() doesn't handle errors, null-ptr-dereference will be
triggered if errors occur.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e15e6869)

aa182265

md: fix kabi broken in struct mddev · 6916fc98

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

Struct mddev is just used inside raid, just in case that md_mod is compiled
from new kernel, and raid1/raid10 or other out-of-tree raid are compiled
from old kernel.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 2eb22263)

6916fc98

md: use interruptible apis in idle/frozen_sync_thread · 719d1c09

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

Before refactoring idle and frozen from action_store, interruptible apis
is used so that hungtask warning won't be triggered if it takes too long
to finish indle/frozen sync_thread. This patch do the same.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 73f974e1)

719d1c09

md: wake up 'resync_wait' at last in md_reap_sync_thread() · ed109fc3

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

We just replace md_reap_sync_thread() with wait_event(resync_wait, ...)
from action_store(), this patch just make sure action_store() will still
wait for everything to be done in md_reap_sync_thread().
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 54570486)

ed109fc3

md: refactor idle/frozen_sync_thread() · 7881435d

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

Our test found a following deadlock in raid10:

1) Issue a normal write, and such write failed:

  raid10_end_write_request
   set_bit(R10BIO_WriteError, &r10_bio->state)
   one_write_done
    reschedule_retry

  // later from md thread
  raid10d
   handle_write_completed
    list_add(&r10_bio->retry_list, &conf->bio_end_io_list)

  // later from md thread
  raid10d
   if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
    list_move(conf->bio_end_io_list.prev, &tmp)
    r10_bio = list_first_entry(&tmp, struct r10bio, retry_list)
    raid_end_bio_io(r10_bio)

Dependency chain 1: normal io is waiting for updating superblock

2) Trigger a recovery:

  raid10_sync_request
   raise_barrier

Dependency chain 2: sync thread is waiting for normal io

3) echo idle/frozen to sync_action:

  action_store
   mddev_lock
    md_unregister_thread
     kthread_stop

Dependency chain 3: drop 'reconfig_mutex' is waiting for sync thread

4) md thread can't update superblock:

  raid10d
   md_check_recovery
    if (mddev_trylock(mddev))
     md_update_sb

Dependency chain 4: update superblock is waiting for 'reconfig_mutex'

Hence cyclic dependency exist, in order to fix the problem, we must
break one of them. Dependency 1 and 2 can't be broken because they are
foundation design. Dependency 4 may be possible if it can be guaranteed
that no io can be inflight, however, this requires a new mechanism which
seems complex. Dependency 3 is a good choice, because idle/frozen only
requires sync thread to finish, which can be done asynchronously that is
already implemented, and 'reconfig_mutex' is not needed anymore.

This patch switch 'idle' and 'frozen' to wait sync thread to be done
asynchronously, and this patch also add a sequence counter to record how
many times sync thread is done, so that 'idle' won't keep waiting on new
started sync thread.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 1ade24b6)

7881435d

md: add a mutex to synchronize idle and frozen in action_store() · 9b644924

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

Currently, for idle and frozen, action_store will hold 'reconfig_mutex'
and call md_reap_sync_thread() to stop sync thread, however, this will
cause deadlock (explained in the next patch). In order to fix the
problem, following patch will release 'reconfig_mutex' and wait on
'resync_wait', like md_set_readonly() and do_md_stop() does.

Consider that action_store() will set/clear 'MD_RECOVERY_FROZEN'
unconditionally, which might cause unexpected problems, for example,
frozen just set 'MD_RECOVERY_FROZEN' and is still in progress, while
'idle' clear 'MD_RECOVERY_FROZEN' and new sync thread is started, which
might starve in progress frozen.

This patch add a mutex to synchronize idle and frozen from
action_store().
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 1c617ac5)

9b644924

md: refactor action_store() for 'idle' and 'frozen' · e8a6dd98

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

Prepare to handle 'idle' and 'frozen' differently to fix a deadlock, there
are no functional changes except that MD_RECOVERY_RUNNING is checked
again after 'reconfig_mutex' is held.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e98a235f)

e8a6dd98

Revert "md: unlock mddev before reap sync_thread in action_store" · ecbb08a8

由 Yu Kuai 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

--------------------------------

This reverts commit 9dfbdafd.

Because it will introduce a defect that sync_thread can be running while
MD_RECOVERY_RUNNING is cleared, which will cause some unexpected problems,
for example:

list_add corruption. prev->next should be next (ffff0001ac1daba0), but was ffff0000ce1a02a0. (prev=ffff0000ce1a02a0).
Call trace:
 __list_add_valid+0xfc/0x140
 insert_work+0x78/0x1a0
 __queue_work+0x500/0xcf4
 queue_work_on+0xe8/0x12c
 md_check_recovery+0xa34/0xf30
 raid10d+0xb8/0x900 [raid10]
 md_thread+0x16c/0x2cc
 kthread+0x1a4/0x1ec
 ret_from_fork+0x10/0x18

This is because work is requeued while it's still inside workqueue:

t1:			t2:
action_store
 mddev_lock
  if (mddev->sync_thread)
   mddev_unlock
   md_unregister_thread
   // first sync_thread is done
			md_check_recovery
			 mddev_try_lock
			 /*
			  * once MD_RECOVERY_DONE is set, new sync_thread
			  * can start.
			  */
			 set_bit(MD_RECOVERY_RUNNING, &mddev->recovery)
			 INIT_WORK(&mddev->del_work, md_start_sync)
			 queue_work(md_misc_wq, &mddev->del_work)
			  test_and_set_bit(WORK_STRUCT_PENDING_BIT, ...)
			  // set pending bit
			  insert_work
			   list_add_tail
			 mddev_unlock
   mddev_lock_nointr
   md_reap_sync_thread
   // MD_RECOVERY_RUNNING is cleared
 mddev_unlock

t3:

// before queued work started from t2
md_check_recovery
 // MD_RECOVERY_RUNNING is not set, a new sync_thread can be started
 INIT_WORK(&mddev->del_work, md_start_sync)
  work->data = 0
  // work pending bit is cleared
 queue_work(md_misc_wq, &mddev->del_work)
  insert_work
   list_add_tail
   // list is corrupted

This patch revert the commit to fix the problem, the deadlock this
commit tries to fix will be fixed in following patches.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Signed-off-by: NSong Liu <song@kernel.org>
Link: https://lore.kernel.org/r/20230322064122.2384589-2-yukuai1@huaweicloud.comReviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 4a53e631)

ecbb08a8

md: unlock mddev before reap sync_thread in action_store · f40aae37

由 Guoqing Jiang 提交于 5月 31, 2023

mainline inclusion
from mainline-v6.0-rc1
commit 9dfbdafd
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6OMCC
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.3-rc3&id=9dfbdafda3b34e262e43e786077bab8e476a89d1

--------------------------------

Since the bug which commit 8b48ec23 ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
   not changed accidentally by others.
Signed-off-by: NGuoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit dd3bd170)

f40aae37

block: fix wrong mode for blkdev_put() from disk_scan_partitions() · 270d1e09

由 Yu Kuai 提交于 5月 31, 2023

mainline inclusion
from mainline-v6.3-rc2
commit 428913bc
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f

--------------------------------

If disk_scan_partitions() is called with 'FMODE_EXCL',
blkdev_get_by_dev() will be called without 'FMODE_EXCL', however, follow
blkdev_put() is still called with 'FMODE_EXCL', which will cause
'bd_holders' counter to leak.

Fix the problem by using the right mode for blkdev_put().

Reported-by: syzbot+2bcc0d79e548c4f62a59@syzkaller.appspotmail.com
Link: https://lore.kernel.org/lkml/f9649d501bc8c3444769418f6c26263555d9d3be.camel@linux.ibm.com/T/Tested-by: NJulian Ruess <julianr@linux.ibm.com>
Fixes: e5cfefa9 ("block: fix scan partition for exclusively open device again")
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 7058c39d)

270d1e09

block: fix scan partition for exclusively open device again · 25745211

由 Yu Kuai 提交于 5月 31, 2023

mainline inclusion
from mainline-v6.3-rc1
commit e5cfefa9
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5cfefa97bccf956ea0bb6464c1f6c84fd7a8d9f

--------------------------------

As explained in commit 36369f46 ("block: Do not reread partition table
on exclusively open device"), reread partition on the device that is
exclusively opened by someone else is problematic.

This patch will make sure partition scan will only be proceed if current
thread open the device exclusively, or the device is not opened
exclusively, and in the later case, other scanners and exclusive openers
will be blocked temporarily until partition scan is done.

Fixes: 10c70d95 ("block: remove the bd_openers checks in blk_drop_partitions")
Cc: <stable@vger.kernel.org>
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230217022200.3092987-3-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	block/genhd.c
	block/ioctl.c
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 8f9c8fc5)

25745211

block: merge disk_scan_partitions and blkdev_reread_part · 9b0317ab

由 Christoph Hellwig 提交于 5月 31, 2023

mainline inclusion
from mainline-v5.17-rc1
commit e16e506c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6MQLP
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e16e506ccd673a3a888a34f8f694698305840044

--------------------------------

Unify the functionality that implements a partition rescan for a
gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-6-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

Conflicts:
	block/blk.h
	block/genhd.c
	block/ioctl.c
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit 1c0b1b48)

9b0317ab

arm64: kdump: Avoid reserving low memory repeatedly · 133760dd

由 Li Huafei 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6Y5Y1

-------------------------------

We call reserve_crashkernel_high() before map_mem() to reserve high
memory in advance, which in turn can avoid using page level mapping for
all memory above 4G to optimize performance. And after
reserve_crashkernel_high(), reserve_crashkernel_low() is also needed to
reserve low memory. But when the system RAM is less than 4G, the memory
reserved by reserve_crashkernel_high() is already low memory (less than
4G), reserve_crashkernel_low() may reserve low memory again and the
memory it reserves may be higher than that reserved by
reserve_crashkernel_high(). Looking at /proc/iomem would have:

 # cat /proc/iomem | grep -i crash
    65400000-953fffff : Crash kernel  ==> crashk_res
    a7800000-b77fffff : Crash kernel  ==> crashk_res_low

At this point kexec-tools will incorrectly use the second memory segment
for the kdump kernel image load, causing the kernel load address check
to fail during kexec load (see sanity_check_segment_list()).

When the memory reserved by reserve_crashkernel_high() meets the low
memory requirement, reserve_crashkernel_low() is no longer called to
reserve memory and avoid introducing problems with duplicate
reservations.

Fixes: baac34dd ("arm64: kdump: Use page-level mapping for the high memory of crashkernel")
Signed-off-by: NLi Huafei <lihuafei1@huawei.com>
Reviewed-by: NYang Jihong <yangjihong1@huawei.com>
(cherry picked from commit e5c9d379)

133760dd

30 5月, 2023 12 次提交

!795 sched/fair: Introduce multiple qos level · c4fb2bc6

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @zhaowenhui8 
 
Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected
to be with extremely high or low priority level. Using qos_level_weight
to reweight the shares when calculating group's weight. Meanwhile,
set offline task's schedule policy to SCHED_IDLE so that it can be
preempted at check_preempt_wakeup.

kernel option:
CONFIG_QOS_SCHED_MULTILEVEL 
 
Link:https://gitee.com/openeuler/kernel/pulls/795 

Reviewed-by: Zucheng Zheng <zhengzucheng@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

c4fb2bc6

!850 Fix race condition in __percpu_counter_sum() function within cpu hotplug · 623763f1

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @henryze

The dying CPU has been removed from the online_mask, but the hotplug notifier have not been called to fold the percpu count into the global counter sum.
This race condition is avoided by including the dying CPU in the iteration mask.

Link:https://gitee.com/openeuler/kernel/pulls/850

Reviewed-by: Wei Li <liwei391@huawei.com>
Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

623763f1

!849 drivers/cpufreq: gain accurate CPU frequency from cpufreq/cpuinfo_cur_freq · f1189855

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @henryze 
 
When users want to get frequency by cpuinfo_cur_freq under cpufreq sysfs,
they often get the invalid result like:

$ cat /sys/devices/system/cpu/cpu6/cpufreq/cpuinfo_cur_freq
4294967295

So this series provides fixes to the concerned issue.

Reference: https://lore.kernel.org/all/20230516133248.712242-3-zengheng4@huawei.com/ 
 
Link:https://gitee.com/openeuler/kernel/pulls/849 

Reviewed-by: Xiongfeng Wang <wangxiongfeng2@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

f1189855

!773 Compiler: Add value profile support for kernel. · dec74be4

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @xiongzhou4

Provides value profile support for kernel.
The implementation is based on the existing GCOV feature of the kernel. When the option is opened, the GCOV option `-fprofile-arcs` is changed to `-fprofile-generate`. The latter includes the former and value profile, which can provide more comprehensive feedback directed optimization ability.
The added feature is called _PGO kernel_ , which can be used to improve the performance of a single application runtime environment.

kernel option(default is n):
CONFIG_PGO_KERNEL=y

Link:https://gitee.com/openeuler/kernel/pulls/773

Reviewed-by: Liu Chao <liuchao173@huawei.com>
Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com>
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>
Acked-by: Xie XiuQi <xiexiuqi@huawei.com>

dec74be4

!842 net: hns3: add support for Hisilicon ptp sync device · 866cc5dd

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @svishen 
 
This pull Requests support hns3 driver provide ptp driver to get 1588 clock from ethernet.
But only the first PF on main chip can support this, so if getting ptp time from other chip, 
may have some bus latency. The PTP sync device use to eliminate the bus latency.

issue:
https://gitee.com/openeuler/kernel/issues/I78MGV 
 
Link:https://gitee.com/openeuler/kernel/pulls/842 

Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

866cc5dd

!844 A patchset of sched to improve benchmark performance · df9cfeee

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @NNNNicole 
 
1.sched/pelt: Relax the sync of *_sum with *_avg (patch1-patch3)
2.Adjust NUMA imbalance for multiple LLCs(patch4-patch6)
3.sched: Queue task on wakelist in the same llc if the wakee cpu is idle(patch7)
4.Clear ttwu_pending after enqueue_task(patch8)
 
 
Link:https://gitee.com/openeuler/kernel/pulls/844 

Reviewed-by: Zucheng Zheng <zhengzucheng@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

df9cfeee

!837 Backport bugfixes for RDMA/hns · 162d1b0b

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @stinft 
 
#I76PY9 
#I76PUJ 
#I76PRT  
 
Link:https://gitee.com/openeuler/kernel/pulls/837 

Reviewed-by: Chengchang Tang <tangchengchang@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

162d1b0b

GCC: Add value profile support for kernel. · 2872514e

由 xiongzhou4 提交于 5月 16, 2023

GCC inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I734PM

---------------------------------

This feature add value profile support for kernel by changing GCOV
option "-fprofile-arcs" to "-fprofile-generate" when the new added
config "PGO_KERNEL" is set to y.

Like GCOV, the symbols required by value profile are migrated from
GCC source codes as they cannot be linked to kernel. Specifically,
from libgcc/libgcov-profiler.c to kernel/gcov/gcc_base.c.

kernel options:
CONFIG_PGO_KERNEL=y
Signed-off-by: NXiong Zhou <xiongzhou4@huawei.com>
Reviewed-by: NLi Yancheng <liyancheng@huawei.com>

2872514e

!803 ACC support no-sva feature · edb5d824

由 openeuler-ci-bot 提交于 5月 30, 2023

Merge Pull Request from: @xiao_jiang_shui 
 
ACC support no-sva feature
issue：https://gitee.com/openeuler/kernel/issues/I773SD
 
 
Link:https://gitee.com/openeuler/kernel/pulls/803 

Reviewed-by: Yang Shen <shenyang39@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

edb5d824

sched/fair: Introduce multiple qos level · c51ad919

由 Zhao Wenhui 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I737X1

-------------------------------

Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected
to be with extremely high or low priority level. Using qos_level_weight
to reweight the shares when calculating group's weight. Meanwhile,
set offline task's schedule policy to SCHED_IDLE so that it can be
preempted at check_preempt_wakeup.
Signed-off-by: NZhao Wenhui <zhaowenhui8@huawei.com>

c51ad919

sched: Clear ttwu_pending after enqueue_task() · a6dcd26f

由 Tianchen Ding 提交于 5月 28, 2023

mainline inclusion
from mainline-v6.2-rc1
commit d6962c4f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I78WM8

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.4-rc3&id=d6962c4fe8f96f7d384d6489b6b5ab5bf3e35991

--------------------------------

We found a long tail latency in schbench whem m*t is close to nr_cpus.
(e.g., "schbench -m 2 -t 16" on a machine with 32 cpus.)

This is because when the wakee cpu is idle, rq->ttwu_pending is cleared
too early, and idle_cpu() will return true until the wakee task enqueued.
This will mislead the waker when selecting idle cpu, and wake multiple
worker threads on the same wakee cpu. This situation is enlarged by
commit f3dd3f67 ("sched: Remove the limitation of WF_ON_CPU on
wakelist if wakee cpu is idle") because it tends to use wakelist.

Here is the result of "schbench -m 2 -t 16" on a VM with 32vcpu
(Intel(R) Xeon(R) Platinum 8369B).

Latency percentiles (usec):
                base      base+revert_f3dd3f67   base+this_patch
50.0000th:         9                            13                 9
75.0000th:        12                            19                12
90.0000th:        15                            22                15
95.0000th:        18                            24                17
*99.0000th:       27                            31                24
99.5000th:      3364                            33                27
99.9000th:     12560                            36                30

We also tested on unixbench and hackbench, and saw no performance
change.
Signed-off-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NMel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20221104023601.12844-1-dtcccc@linux.alibaba.com

a6dcd26f

sched: Remove the limitation of WF_ON_CPU on wakelist if wakee cpu is idle · 588d8f44

由 Guan Jing 提交于 5月 28, 2023

mainline inclusion
from mainline-v6.0-rc1
commit f3dd3f67
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I78WM8

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.4-rc3&id=f3dd3f674555bd9455c5ae7fafce0696bd9931b3

--------------------------------

Wakelist can help avoid cache bouncing and offload the overhead of waker
cpu. So far, using wakelist within the same llc only happens on
WF_ON_CPU, and this limitation could be removed to further improve
wakeup performance.

The commit 518cd623 ("sched: Only queue remote wakeups when
crossing cache boundaries") disabled queuing tasks on wakelist when
the cpus share llc. This is because, at that time, the scheduler must
send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
supports TIF_POLLING, so this is not a problem now when the wakee cpu is
in idle polling.

Benefits:
  Queuing the task on idle cpu can help improving performance on waker cpu
  and utilization on wakee cpu, and further improve locality because
  the wakee cpu can handle its own rq. This patch helps improving rt on
  our real java workloads where wakeup happens frequently.

  Consider the normal condition (CPU0 and CPU1 share same llc)
  Before this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    resched()

  After this patch:

         CPU0                                       CPU1

    select_task_rq()                                idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

                                                    rq_lock(CPU1->rq)
                                                    enqueue_task(CPU1->rq)
                                                    resched()

  We see CPU0 can finish its work earlier. It only needs to put task to
  wakelist and return.
  While CPU1 is idle, so let itself handle its own runqueue data.

This patch brings no difference about IPI.
  This patch only takes effect when the wakee cpu is:
  1) idle polling
  2) idle not polling

  For 1), there will be no IPI with or without this patch.

  For 2), there will always be an IPI before or after this patch.
  Before this patch: waker cpu will enqueue task and check preempt. Since
  "idle" will be sure to be preempted, waker cpu must send a resched IPI.
  After this patch: waker cpu will put the task to the wakelist of wakee
  cpu, and send an IPI.

Benchmark:
We've tested schbench, unixbench, and hachbench on both x86 and arm64.

On x86 (Intel Xeon Platinum 8269CY):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                             8            6
        75.0000th:                            10            7
        90.0000th:                            11            8
        95.0000th:                            12            8
        *99.0000th:                           13           10
        99.5000th:                            15           11
        99.9000th:                            18           14

  Unixbench with full threads (104)
                                            before        after
    Dhrystone 2 using register variables  3011862938    3009935994  -0.06%
    Double-Precision Whetstone              617119.3      617298.5   0.03%
    Execl Throughput                         27667.3       27627.3  -0.14%
    File Copy 1024 bufsize 2000 maxblocks   785871.4      784906.2  -0.12%
    File Copy 256 bufsize 500 maxblocks     210113.6      212635.4   1.20%
    File Copy 4096 bufsize 8000 maxblocks  2328862.2     2320529.1  -0.36%
    Pipe Throughput                      145535622.8   145323033.2  -0.15%
    Pipe-based Context Switching           3221686.4     3583975.4  11.25%
    Process Creation                        101347.1      103345.4   1.97%
    Shell Scripts (1 concurrent)            120193.5      123977.8   3.15%
    Shell Scripts (8 concurrent)             17233.4       17138.4  -0.55%
    System Call Overhead                   5300604.8     5312213.6   0.22%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     3.246        2.251

On arm64 (Ampere Altra):
  schbench -m 2 -t 8

    Latency percentiles (usec)              before        after
        50.0000th:                            14           10
        75.0000th:                            19           14
        90.0000th:                            22           16
        95.0000th:                            23           16
        *99.0000th:                           24           17
        99.5000th:                            24           17
        99.9000th:                            28           25

  Unixbench with full threads (80)
                                            before        after
    Dhrystone 2 using register variables  3536194249    3537019613   0.02%
    Double-Precision Whetstone              629383.6      629431.6   0.01%
    Execl Throughput                         65920.5       65846.2  -0.11%
    File Copy 1024 bufsize 2000 maxblocks  1063722.8     1064026b.8   0.03%
    File Copy 256 bufsize 500 maxblocks     322684.5      318724.5  -1.23%
    File Copy 4096 bufsize 8000 maxblocks  2348285.3     2328804.8  -0.83%
    Pipe Throughput                      133542875.3   131619389.8  -1.44%
    Pipe-based Context Switching           3215356.1     3576945.1  11.25%
    Process Creation                        108520.5      120184.6  10.75%
    Shell Scripts (1 concurrent)            122636.3        121888  -0.61%
    Shell Scripts (8 concurrent)             17462.1       17381.4  -0.46%
    System Call Overhead                   4429998.9     44350061.7   0.11%

  hackbench -g 1 -l 100000
                                            before        after
    Time                                     4.217        2.916

Our patch has improvement on schbench, hackbench
and Pipe-based Context Switching of unixbench
when there exists idle cpus,
and no obvious regression on other tests of unixbench.
This can help improve rt in scenes where wakeup happens frequently.
Signed-off-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NValentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.comSigned-off-by: NGuan Jing <guanjing6@huawei.com>

588d8f44

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功