提交 · 1f6736e6fdbe8b3222a6512a6e6735153445265e · openeuler / Kernel

12 4月, 2023 17 次提交

ext4: Fix reusing stale buffer heads from last failed mounting · 1f6736e6

由 Zhihao Cheng 提交于 4月 12, 2023

maillist inclusion
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6MMUV
CVE: NA

Reference: https://www.spinics.net/lists/linux-ext4/msg88237.html

--------------------------------

Following process makes ext4 load stale buffer heads from last failed
mounting in a new mounting operation:
mount_bdev
 ext4_fill_super
 | ext4_load_and_init_journal
 |  ext4_load_journal
 |   jbd2_journal_load
 |    load_superblock
 |     journal_get_superblock
 |      set_buffer_verified(bh) // buffer head is verified
 |   jbd2_journal_recover // failed caused by EIO
 | goto failed_mount3a // skip 'sb->s_root' initialization
 deactivate_locked_super
  kill_block_super
   generic_shutdown_super
    if (sb->s_root)
    // false, skip ext4_put_super->invalidate_bdev->
    // invalidate_mapping_pages->mapping_evict_folio->
    // filemap_release_folio->try_to_free_buffers, which
    // cannot drop buffer head.
   blkdev_put
    blkdev_put_whole
     if (atomic_dec_and_test(&bdev->bd_openers))
     // false, systemd-udev happens to open the device. Then
     // blkdev_flush_mapping->kill_bdev->truncate_inode_pages->
     // truncate_inode_folio->truncate_cleanup_folio->
     // folio_invalidate->block_invalidate_folio->
     // filemap_release_folio->try_to_free_buffers will be skipped,
     // dropping buffer head is missed again.

Second mount:
ext4_fill_super
 ext4_load_and_init_journal
  ext4_load_journal
   ext4_get_journal
    jbd2_journal_init_inode
     journal_init_common
      bh = getblk_unmovable
       bh = __find_get_block // Found stale bh in last failed mounting
      journal->j_sb_buffer = bh
   jbd2_journal_load
    load_superblock
     journal_get_superblock
      if (buffer_verified(bh))
      // true, skip journal->j_format_version = 2, value is 0
    jbd2_journal_recover
     do_one_pass
      next_log_block += count_tags(journal, bh)
      // According to journal_tag_bytes(), 'tag_bytes' calculating is
      // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3()
      // returns false because 'j->j_format_version >= 2' is not true,
      // then we get wrong next_log_block. The do_one_pass may exit
      // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'.

The filesystem is corrupted here, journal is partially replayed, and
new journal sequence number actually is already used by last mounting.

The invalidate_bdev() can drop all buffer heads even racing with bare
reading block device(eg. systemd-udev), so we can fix it by invalidating
bdev in error handling path in __ext4_fill_super().

Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171
Fixes: 25ed6e8a ("jbd2: enable journal clients to enable v2 checksumming")
Cc: stable@vger.kernel.org # v3.5
Conflicts:
	fs/ext4/super.c
	[ a7a79c29("ext4: unify the ext4 super block loading
	  operation") is not applied.
	  7edfd85b("ext4: Completely separate options parsing and sb
	  setup") is not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

1f6736e6

btrfs: fix race between quota disable and quota assign ioctls · bbd9649a

由 Filipe Manana 提交于 4月 12, 2023

mainline inclusion
from mainline-v6.2-rc8
commit 2f1a6be1
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6PQCT
CVE: CVE-2023-1611

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2f1a6be12ab6c8470d5776e68644726c94257c54

--------------------------------

The quota assign ioctl can currently run in parallel with a quota disable
ioctl call. The assign ioctl uses the quota root, while the disable ioctl
frees that root, and therefore we can have a use-after-free triggered in
the assign ioctl, leading to a trace like the following when KASAN is
enabled:

  [672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0
  [672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736
  [672.724][T736]
  [672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37
  [672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
  [672.727][T736] Call Trace:
  [672.728][T736]  <TASK>
  [672.728][T736]  dump_stack_lvl+0xd9/0x150
  [672.725][T736]  print_report+0xc1/0x5e0
  [672.720][T736]  ? __virt_addr_valid+0x61/0x2e0
  [672.727][T736]  ? __phys_addr+0xc9/0x150
  [672.725][T736]  ? btrfs_search_slot+0x2962/0x2db0
  [672.722][T736]  kasan_report+0xc0/0xf0
  [672.729][T736]  ? btrfs_search_slot+0x2962/0x2db0
  [672.724][T736]  btrfs_search_slot+0x2962/0x2db0
  [672.723][T736]  ? fs_reclaim_acquire+0xba/0x160
  [672.722][T736]  ? split_leaf+0x13d0/0x13d0
  [672.726][T736]  ? rcu_is_watching+0x12/0xb0
  [672.723][T736]  ? kmem_cache_alloc+0x338/0x3c0
  [672.722][T736]  update_qgroup_status_item+0xf7/0x320
  [672.724][T736]  ? add_qgroup_rb+0x3d0/0x3d0
  [672.739][T736]  ? do_raw_spin_lock+0x12d/0x2b0
  [672.730][T736]  ? spin_bug+0x1d0/0x1d0
  [672.737][T736]  btrfs_run_qgroups+0x5de/0x840
  [672.730][T736]  ? btrfs_qgroup_rescan_worker+0xa70/0xa70
  [672.738][T736]  ? __del_qgroup_relation+0x4ba/0xe00
  [672.738][T736]  btrfs_ioctl+0x3d58/0x5d80
  [672.735][T736]  ? tomoyo_path_number_perm+0x16a/0x550
  [672.737][T736]  ? tomoyo_execute_permission+0x4a0/0x4a0
  [672.731][T736]  ? btrfs_ioctl_get_supported_features+0x50/0x50
  [672.737][T736]  ? __sanitizer_cov_trace_switch+0x54/0x90
  [672.734][T736]  ? do_vfs_ioctl+0x132/0x1660
  [672.730][T736]  ? vfs_fileattr_set+0xc40/0xc40
  [672.730][T736]  ? _raw_spin_unlock_irq+0x2e/0x50
  [672.732][T736]  ? sigprocmask+0xf2/0x340
  [672.737][T736]  ? __fget_files+0x26a/0x480
  [672.732][T736]  ? bpf_lsm_file_ioctl+0x9/0x10
  [672.738][T736]  ? btrfs_ioctl_get_supported_features+0x50/0x50
  [672.736][T736]  __x64_sys_ioctl+0x198/0x210
  [672.736][T736]  do_syscall_64+0x39/0xb0
  [672.731][T736]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [672.739][T736] RIP: 0033:0x4556ad
  [672.742][T736]  </TASK>
  [672.743][T736]
  [672.748][T736] Allocated by task 27677:
  [672.743][T736]  kasan_save_stack+0x22/0x40
  [672.741][T736]  kasan_set_track+0x25/0x30
  [672.741][T736]  __kasan_kmalloc+0xa4/0xb0
  [672.749][T736]  btrfs_alloc_root+0x48/0x90
  [672.746][T736]  btrfs_create_tree+0x146/0xa20
  [672.744][T736]  btrfs_quota_enable+0x461/0x1d20
  [672.743][T736]  btrfs_ioctl+0x4a1c/0x5d80
  [672.747][T736]  __x64_sys_ioctl+0x198/0x210
  [672.749][T736]  do_syscall_64+0x39/0xb0
  [672.744][T736]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [672.756][T736]
  [672.757][T736] Freed by task 27677:
  [672.759][T736]  kasan_save_stack+0x22/0x40
  [672.759][T736]  kasan_set_track+0x25/0x30
  [672.756][T736]  kasan_save_free_info+0x2e/0x50
  [672.751][T736]  ____kasan_slab_free+0x162/0x1c0
  [672.758][T736]  slab_free_freelist_hook+0x89/0x1c0
  [672.752][T736]  __kmem_cache_free+0xaf/0x2e0
  [672.752][T736]  btrfs_put_root+0x1ff/0x2b0
  [672.759][T736]  btrfs_quota_disable+0x80a/0xbc0
  [672.752][T736]  btrfs_ioctl+0x3e5f/0x5d80
  [672.756][T736]  __x64_sys_ioctl+0x198/0x210
  [672.753][T736]  do_syscall_64+0x39/0xb0
  [672.765][T736]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [672.769][T736]
  [672.768][T736] The buggy address belongs to the object at ffff888022ec0000
  [672.768][T736]  which belongs to the cache kmalloc-4k of size 4096
  [672.769][T736] The buggy address is located 520 bytes inside of
  [672.769][T736]  freed 4096-byte region [ffff888022ec0000, ffff888022ec1000)
  [672.760][T736]
  [672.764][T736] The buggy address belongs to the physical page:
  [672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0
  [672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
  [672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
  [672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002
  [672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000
  [672.771][T736] page dumped because: kasan: bad access detected
  [672.778][T736] page_owner tracks the page as allocated
  [672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88
  [672.779][T736]  get_page_from_freelist+0x119c/0x2d50
  [672.779][T736]  __alloc_pages+0x1cb/0x4a0
  [672.776][T736]  alloc_pages+0x1aa/0x270
  [672.773][T736]  allocate_slab+0x260/0x390
  [672.771][T736]  ___slab_alloc+0xa9a/0x13e0
  [672.778][T736]  __slab_alloc.constprop.0+0x56/0xb0
  [672.771][T736]  __kmem_cache_alloc_node+0x136/0x320
  [672.789][T736]  __kmalloc+0x4e/0x1a0
  [672.783][T736]  tomoyo_realpath_from_path+0xc3/0x600
  [672.781][T736]  tomoyo_path_perm+0x22f/0x420
  [672.782][T736]  tomoyo_path_unlink+0x92/0xd0
  [672.780][T736]  security_path_unlink+0xdb/0x150
  [672.788][T736]  do_unlinkat+0x377/0x680
  [672.788][T736]  __x64_sys_unlink+0xca/0x110
  [672.789][T736]  do_syscall_64+0x39/0xb0
  [672.783][T736]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [672.784][T736] page last free stack trace:
  [672.787][T736]  free_pcp_prepare+0x4e5/0x920
  [672.787][T736]  free_unref_page+0x1d/0x4e0
  [672.784][T736]  __unfreeze_partials+0x17c/0x1a0
  [672.797][T736]  qlist_free_all+0x6a/0x180
  [672.796][T736]  kasan_quarantine_reduce+0x189/0x1d0
  [672.797][T736]  __kasan_slab_alloc+0x64/0x90
  [672.793][T736]  kmem_cache_alloc+0x17c/0x3c0
  [672.799][T736]  getname_flags.part.0+0x50/0x4e0
  [672.799][T736]  getname_flags+0x9e/0xe0
  [672.792][T736]  vfs_fstatat+0x77/0xb0
  [672.791][T736]  __do_sys_newlstat+0x84/0x100
  [672.798][T736]  do_syscall_64+0x39/0xb0
  [672.796][T736]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [672.790][T736]
  [672.791][T736] Memory state around the buggy address:
  [672.799][T736]  ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  [672.805][T736]  ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  [672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  [672.809][T736]                       ^
  [672.809][T736]  ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  [672.809][T736]  ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex
before calling btrfs_run_qgroups(), which is what all qgroup ioctls should
call.
Reported-by: Nbutt3rflyh4ck <butterflyhuangxx@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: NQu Wenruo <wqu@suse.com>
Signed-off-by: NFilipe Manana <fdmanana@suse.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

bbd9649a

dm crypt: add cond_resched() to dmcrypt_write() · d370383e

由 Mikulas Patocka 提交于 4月 12, 2023

mainline inclusion
from mainline-v6.3-rc4
commit fb294b1c
category: bugfix
bugzilla: 188393, https://gitee.com/openeuler/kernel/issues/I6JPSH

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fb294b1c0ba982144ca467a75e7d01ff26304e2b

----------------------------------------

The loop in dmcrypt_write may be running for unbounded amount of time,
thus we need cond_resched() in it.

This commit fixes the following warning:

[ 3391.153255][   C12] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [dmcrypt_write/2:2897]
...
[ 3391.387210][   C12] Call trace:
[ 3391.390338][   C12]  blk_attempt_bio_merge.part.6+0x38/0x158
[ 3391.395970][   C12]  blk_attempt_plug_merge+0xc0/0x1b0
[ 3391.401085][   C12]  blk_mq_submit_bio+0x398/0x550
[ 3391.405856][   C12]  submit_bio_noacct+0x308/0x380
[ 3391.410630][   C12]  dmcrypt_write+0x1e4/0x208 [dm_crypt]
[ 3391.416005][   C12]  kthread+0x130/0x138
[ 3391.419911][   C12]  ret_from_fork+0x10/0x18
Reported-by: Nyangerkun <yangerkun@huawei.com>
Fixes: dc267621 ("dm crypt: offload writes to thread")
Cc: stable@vger.kernel.org
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@kernel.org>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

d370383e

driver core: Fix lockdep warning on wfs_lock · de1cc2ef

由 Saravana Kannan 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.11-rc1
commit 7008e58c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6LM81
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7008e58c63bc8468e8d16154e25d780198b3ecfc

----------------------------------------------

There's a potential deadlock with the following cycle:
wfs_lock --> device_links_lock --> kn->count

Fix this by simply dropping the lock around a list_empty() check that's
just exported to a sysfs file. The sysfs file output is an instantaneous
check anyway and the lock doesn't really add any protection.

Lockdep log:

[   48.808132]
[   48.808132] the existing dependency chain (in reverse order) is:
[   48.809069]
[   48.809069] -> #2 (kn->count){++++}:
[   48.809707]        __kernfs_remove.llvm.7860393000964815146+0x2d4/0x460
[   48.810537]        kernfs_remove_by_name_ns+0x54/0x9c
[   48.811171]        sysfs_remove_file_ns+0x18/0x24
[   48.811762]        device_del+0x2b8/0x5a8
[   48.812269]        __device_link_del+0x98/0xb8
[   48.812829]        device_links_driver_bound+0x210/0x2d8
[   48.813496]        driver_bound+0x44/0xf8
[   48.814000]        really_probe+0x340/0x6e0
[   48.814526]        driver_probe_device+0xb8/0x100
[   48.815117]        device_driver_attach+0x78/0xb8
[   48.815708]        __driver_attach+0xe0/0x194
[   48.816255]        bus_for_each_dev+0xa8/0x11c
[   48.816816]        driver_attach+0x24/0x30
[   48.817331]        bus_add_driver+0x100/0x1e0
[   48.817880]        driver_register+0x78/0x114
[   48.818427]        __platform_driver_register+0x44/0x50
[   48.819089]        0xffffffdbb3227038
[   48.819551]        do_one_initcall+0xd8/0x1e0
[   48.820099]        do_init_module+0xd8/0x298
[   48.820636]        load_module+0x3afc/0x44c8
[   48.821173]        __arm64_sys_finit_module+0xbc/0xf0
[   48.821807]        el0_svc_common+0xbc/0x1d0
[   48.822344]        el0_svc_handler+0x74/0x98
[   48.822882]        el0_svc+0x8/0xc
[   48.823310]
[   48.823310] -> #1 (device_links_lock){+.+.}:
[   48.824036]        __mutex_lock_common+0xe0/0xe44
[   48.824626]        mutex_lock_nested+0x28/0x34
[   48.825185]        device_link_add+0xd4/0x4ec
[   48.825734]        of_link_to_suppliers+0x158/0x204
[   48.826347]        of_fwnode_add_links+0x50/0x64
[   48.826928]        device_link_add_missing_supplier_links+0x90/0x11c
[   48.827725]        fw_devlink_resume+0x58/0x130
[   48.828296]        of_platform_default_populate_init+0xb4/0xd0
[   48.829030]        do_one_initcall+0xd8/0x1e0
[   48.829578]        do_initcall_level+0xb8/0xcc
[   48.830137]        do_basic_setup+0x60/0x7c
[   48.830662]        kernel_init_freeable+0x128/0x1ac
[   48.831275]        kernel_init+0x18/0x29c
[   48.831781]        ret_from_fork+0x10/0x18
[   48.832297]
[   48.832297] -> #0 (wfs_lock){+.+.}:
[   48.832922]        __lock_acquire+0xe04/0x2e20
[   48.833480]        lock_acquire+0xbc/0xec
[   48.833984]        __mutex_lock_common+0xe0/0xe44
[   48.834577]        mutex_lock_nested+0x28/0x34
[   48.835136]        waiting_for_supplier_show+0x3c/0x98
[   48.835781]        dev_attr_show+0x48/0xb4
[   48.836295]        sysfs_kf_seq_show+0xe8/0x184
[   48.836864]        kernfs_seq_show+0x48/0x8c
[   48.837401]        seq_read+0x1c8/0x600
[   48.837884]        kernfs_fop_read+0x68/0x204
[   48.838431]        __vfs_read+0x60/0x214
[   48.838925]        vfs_read+0xbc/0x15c
[   48.839397]        ksys_read+0x78/0xe4
[   48.839869]        __arm64_sys_read+0x1c/0x28
[   48.840416]        el0_svc_common+0xbc/0x1d0
[   48.840953]        el0_svc_handler+0x74/0x98
[   48.841490]        el0_svc+0x8/0xc
[   48.841917]
[   48.841917] other info that might help us debug this:
[   48.841917]
[   48.842920] Chain exists of:
[   48.842920]   wfs_lock --> device_links_lock --> kn->count
[   48.842920]
[   48.844152]  Possible unsafe locking scenario:
[   48.844152]
[   48.844895]        CPU0                    CPU1
[   48.845463]        ----                    ----
[   48.846032]   lock(kn->count);
[   48.846417]                                lock(device_links_lock);
[   48.847203]                                lock(kn->count);
[   48.847902]   lock(wfs_lock);
[   48.848276]
[   48.848276]  *** DEADLOCK ***

Reported-by: Cheng-Jui.Wang@mediatek.com
Signed-off-by: NSaravana Kannan <saravanak@google.com>
Link: https://lore.kernel.org/r/20201104205431.3795207-1-saravanak@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NZhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

de1cc2ef

driver core: platform: Add extra error check in devm_platform_get_irqs_affinity() · 73fa0aa6

由 John Garry 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.11-rc5
commit e1dc2099
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6LM81
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e1dc20995cb9fa04b46e8f37113a7203c906d2bf

---------------------------------------------

The current check of nvec < minvec for nvec returned from
platform_irq_count() will not detect a negative error code in nvec.

This is because minvec is unsigned, and, as such, nvec is promoted to
unsigned in that check, which will make it a huge number (if it contained
-EPROBE_DEFER).

In practice, an error should not occur in nvec for the only in-tree
user, but add a check anyway.

Fixes: e15f2fa9 ("driver core: platform: Add devm_platform_get_irqs_affinity()")
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/1608561055-231244-1-git-send-email-john.garry@huawei.comSigned-off-by: NZhang Zekun <zhangzekun11@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

73fa0aa6

xfs: don't leak memory when attr fork loading fails · 99054b56

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit c78c2d09
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c78c2d0903183a41beb90c56a923e30f90fa91b9

--------------------------------

I observed the following evidence of a memory leak while running xfs/399
from the xfs fsck test suite (edited for brevity):

XFS (sde): Metadata corruption detected at xfs_attr_shortform_verify_struct.part.0+0x7b/0xb0 [xfs], inode 0x1172 attr fork
XFS: Assertion failed: ip->i_af.if_u1.if_data == NULL, file: fs/xfs/libxfs/xfs_inode_fork.c, line: 315
------------[ cut here ]------------
WARNING: CPU: 2 PID: 91635 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
CPU: 2 PID: 91635 Comm: xfs_scrub Tainted: G        W         5.19.0-rc7-xfsx #rc7 6e6475eb29fd9dda3181f81b7ca7ff961d277a40
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
RIP: 0010:assfail+0x46/0x4a [xfs]
Call Trace:
 <TASK>
 xfs_ifork_zap_attr+0x7c/0xb0
 xfs_iformat_attr_fork+0x86/0x110
 xfs_inode_from_disk+0x41d/0x480
 xfs_iget+0x389/0xd70
 xfs_bulkstat_one_int+0x5b/0x540
 xfs_bulkstat_iwalk+0x1e/0x30
 xfs_iwalk_ag_recs+0xd1/0x160
 xfs_iwalk_run_callbacks+0xb9/0x180
 xfs_iwalk_ag+0x1d8/0x2e0
 xfs_iwalk+0x141/0x220
 xfs_bulkstat+0x105/0x180
 xfs_ioc_bulkstat.constprop.0.isra.0+0xc5/0x130
 xfs_file_ioctl+0xa5f/0xef0
 __x64_sys_ioctl+0x82/0xa0
 do_syscall_64+0x2b/0x80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

This newly-added assertion checks that there aren't any incore data
structures hanging off the incore fork when we're trying to reset its
contents.  From the call trace, it is evident that iget was trying to
construct an incore inode from the ondisk inode, but the attr fork
verifier failed and we were trying to undo all the memory allocations
that we had done earlier.

The three assertions in xfs_ifork_zap_attr check that the caller has
already called xfs_idestroy_fork, which clearly has not been done here.
As the zap function then zeroes the pointers, we've effectively leaked
the memory.

The shortest change would have been to insert an extra call to
xfs_idestroy_fork, but it makes more sense to bundle the _idestroy_fork
call into _zap_attr, since all other callsites call _idestroy_fork
immediately prior to calling _zap_attr.  IOWs, it eliminates one way to
fail.

Note: This change only applies cleanly to 2ed5b09b, since we just
reworked the attr fork lifetime.  However, I think this memory leak has
existed since 0f45a1b2, since the chain xfs_iformat_attr_fork ->
xfs_iformat_local -> xfs_init_local_fork will allocate
ifp->if_u1.if_data, but if xfs_ifork_verify_local_attr fails,
xfs_iformat_attr_fork will free i_afp without freeing any of the stuff
hanging off i_afp.  The solution for older kernels I think is to add the
missing call to xfs_idestroy_fork just prior to calling kmem_cache_free.

Found by fuzzing a.sfattr.hdr.totsize = lastbit in xfs/399.

Fixes: 2ed5b09b ("xfs: make inode attribute forks a permanent part of struct xfs_inode")
Probably-Fixes: 0f45a1b2 ("xfs: improve local fork verification")
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_attr_leaf.c
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

99054b56

xfs: delete unnecessary NULL checks · c462f069

由 Dan Carpenter 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit 3f52e016
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f52e016af600982989b5dee958d313c52483c92

--------------------------------

These NULL check are no long needed after commit 2ed5b09b ("xfs:
make inode attribute forks a permanent part of struct xfs_inode").
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

c462f069

xfs: replace inode fork size macros with functions · e4af4f66

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit c01147d9
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c01147d929899f02a0a8b15e406d12784768ca72

--------------------------------

Replace the shouty macros here with typechecked helper functions.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_attr_leaf.c
	fs/xfs/libxfs/xfs_bmap.c
	fs/xfs/libxfs/xfs_bmap_btree.c
	fs/xfs/libxfs/xfs_dir2.c
	fs/xfs/libxfs/xfs_inode_fork.h
	fs/xfs/scrub/symlink.c
	fs/xfs/xfs_itable.c
	fs/xfs/xfs_symlink.c
	fs/xfs/xfs_trace.h
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

e4af4f66

xfs: replace XFS_IFORK_Q with a proper predicate function · b81094f0

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit 932b42c6
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=932b42c66cb5d0ca9800b128415b4ad6b1952b3e

--------------------------------

Replace this shouty macro with a real C function that has a more
descriptive name.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_attr.h
	fs/xfs/libxfs/xfs_inode_fork.c
	fs/xfs/scrub/btree.c
	fs/xfs/xfs_inode.c
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

b81094f0

xfs: use XFS_IFORK_Q to determine the presence of an xattr fork · a0edb32c

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit e45d7cb2
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e45d7cb2356e6b59fe64da28324025cc6fcd3fbd

--------------------------------

Modify xfs_ifork_ptr to return a NULL pointer if the caller asks for the
attribute fork but i_forkoff is zero.  This eliminates the ambiguity
between i_forkoff and i_af.if_present, which should make it easier to
understand the lifetime of attr forks.

While we're at it, remove the if_present checks around calls to
xfs_idestroy_fork and xfs_ifork_zap_attr since they can both handle attr
forks that have already been torn down.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_attr.h
	fs/xfs/libxfs/xfs_inode_fork.c
	fs/xfs/libxfs/xfs_inode_fork.h
	fs/xfs/xfs_icache.c
	fs/xfs/xfs_inode.c
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

a0edb32c

xfs: make inode attribute forks a permanent part of struct xfs_inode · 732df364

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit 2ed5b09b
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2ed5b09b3e8fc274ae8fecd6ab7c5106a364bed1

--------------------------------

Syzkaller reported a UAF bug a while back:

==================================================================
BUG: KASAN: use-after-free in xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
Read of size 4 at addr ffff88802cec919c by task syz-executor262/2958

CPU: 2 PID: 2958 Comm: syz-executor262 Not tainted
5.15.0-0.30.3-20220406_1406 #3
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29
04/01/2014
Call Trace:
 <TASK>
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x82/0xa9 lib/dump_stack.c:106
 print_address_description.constprop.9+0x21/0x2d5 mm/kasan/report.c:256
 __kasan_report mm/kasan/report.c:442 [inline]
 kasan_report.cold.14+0x7f/0x11b mm/kasan/report.c:459
 xfs_ilock_attr_map_shared+0xe3/0xf6 fs/xfs/xfs_inode.c:127
 xfs_attr_get+0x378/0x4c2 fs/xfs/libxfs/xfs_attr.c:159
 xfs_xattr_get+0xe3/0x150 fs/xfs/xfs_xattr.c:36
 __vfs_getxattr+0xdf/0x13d fs/xattr.c:399
 cap_inode_need_killpriv+0x41/0x5d security/commoncap.c:300
 security_inode_need_killpriv+0x4c/0x97 security/security.c:1408
 dentry_needs_remove_privs.part.28+0x21/0x63 fs/inode.c:1912
 dentry_needs_remove_privs+0x80/0x9e fs/inode.c:1908
 do_truncate+0xc3/0x1e0 fs/open.c:56
 handle_truncate fs/namei.c:3084 [inline]
 do_open fs/namei.c:3432 [inline]
 path_openat+0x30ab/0x396d fs/namei.c:3561
 do_filp_open+0x1c4/0x290 fs/namei.c:3588
 do_sys_openat2+0x60d/0x98c fs/open.c:1212
 do_sys_open+0xcf/0x13c fs/open.c:1228
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0x0
RIP: 0033:0x7f7ef4bb753d
Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73
01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48
RSP: 002b:00007f7ef52c2ed8 EFLAGS: 00000246 ORIG_RAX: 0000000000000055
RAX: ffffffffffffffda RBX: 0000000000404148 RCX: 00007f7ef4bb753d
RDX: 00007f7ef4bb753d RSI: 0000000000000000 RDI: 0000000020004fc0
RBP: 0000000000404140 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0030656c69662f2e
R13: 00007ffd794db37f R14: 00007ffd794db470 R15: 00007f7ef52c2fc0
 </TASK>

Allocated by task 2953:
 kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
 kasan_set_track mm/kasan/common.c:46 [inline]
 set_alloc_info mm/kasan/common.c:434 [inline]
 __kasan_slab_alloc+0x68/0x7c mm/kasan/common.c:467
 kasan_slab_alloc include/linux/kasan.h:254 [inline]
 slab_post_alloc_hook mm/slab.h:519 [inline]
 slab_alloc_node mm/slub.c:3213 [inline]
 slab_alloc mm/slub.c:3221 [inline]
 kmem_cache_alloc+0x11b/0x3eb mm/slub.c:3226
 kmem_cache_zalloc include/linux/slab.h:711 [inline]
 xfs_ifork_alloc+0x25/0xa2 fs/xfs/libxfs/xfs_inode_fork.c:287
 xfs_bmap_add_attrfork+0x3f2/0x9b1 fs/xfs/libxfs/xfs_bmap.c:1098
 xfs_attr_set+0xe38/0x12a7 fs/xfs/libxfs/xfs_attr.c:746
 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
 __vfs_setxattr+0x11b/0x177 fs/xattr.c:180
 __vfs_setxattr_noperm+0x128/0x5e0 fs/xattr.c:214
 __vfs_setxattr_locked+0x1d4/0x258 fs/xattr.c:275
 vfs_setxattr+0x154/0x33d fs/xattr.c:301
 setxattr+0x216/0x29f fs/xattr.c:575
 __do_sys_fsetxattr fs/xattr.c:632 [inline]
 __se_sys_fsetxattr fs/xattr.c:621 [inline]
 __x64_sys_fsetxattr+0x243/0x2fe fs/xattr.c:621
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0x0

Freed by task 2949:
 kasan_save_stack+0x19/0x38 mm/kasan/common.c:38
 kasan_set_track+0x1c/0x21 mm/kasan/common.c:46
 kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:360
 ____kasan_slab_free mm/kasan/common.c:366 [inline]
 ____kasan_slab_free mm/kasan/common.c:328 [inline]
 __kasan_slab_free+0xe2/0x10e mm/kasan/common.c:374
 kasan_slab_free include/linux/kasan.h:230 [inline]
 slab_free_hook mm/slub.c:1700 [inline]
 slab_free_freelist_hook mm/slub.c:1726 [inline]
 slab_free mm/slub.c:3492 [inline]
 kmem_cache_free+0xdc/0x3ce mm/slub.c:3508
 xfs_attr_fork_remove+0x8d/0x132 fs/xfs/libxfs/xfs_attr_leaf.c:773
 xfs_attr_sf_removename+0x5dd/0x6cb fs/xfs/libxfs/xfs_attr_leaf.c:822
 xfs_attr_remove_iter+0x68c/0x805 fs/xfs/libxfs/xfs_attr.c:1413
 xfs_attr_remove_args+0xb1/0x10d fs/xfs/libxfs/xfs_attr.c:684
 xfs_attr_set+0xf1e/0x12a7 fs/xfs/libxfs/xfs_attr.c:802
 xfs_xattr_set+0xeb/0x1a9 fs/xfs/xfs_xattr.c:59
 __vfs_removexattr+0x106/0x16a fs/xattr.c:468
 cap_inode_killpriv+0x24/0x47 security/commoncap.c:324
 security_inode_killpriv+0x54/0xa1 security/security.c:1414
 setattr_prepare+0x1a6/0x897 fs/attr.c:146
 xfs_vn_change_ok+0x111/0x15e fs/xfs/xfs_iops.c:682
 xfs_vn_setattr_size+0x5f/0x15a fs/xfs/xfs_iops.c:1065
 xfs_vn_setattr+0x125/0x2ad fs/xfs/xfs_iops.c:1093
 notify_change+0xae5/0x10a1 fs/attr.c:410
 do_truncate+0x134/0x1e0 fs/open.c:64
 handle_truncate fs/namei.c:3084 [inline]
 do_open fs/namei.c:3432 [inline]
 path_openat+0x30ab/0x396d fs/namei.c:3561
 do_filp_open+0x1c4/0x290 fs/namei.c:3588
 do_sys_openat2+0x60d/0x98c fs/open.c:1212
 do_sys_open+0xcf/0x13c fs/open.c:1228
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x3a/0x7e arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0x0

The buggy address belongs to the object at ffff88802cec9188
 which belongs to the cache xfs_ifork of size 40
The buggy address is located 20 bytes inside of
 40-byte region [ffff88802cec9188, ffff88802cec91b0)
The buggy address belongs to the page:
page:00000000c3af36a1 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x2cec9
flags: 0xfffffc0000200(slab|node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000200 ffffea00009d2580 0000000600000006 ffff88801a9ffc80
raw: 0000000000000000 0000000080490049 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88802cec9080: fb fb fb fc fc fa fb fb fb fb fc fc fb fb fb fb
 ffff88802cec9100: fb fc fc fb fb fb fb fb fc fc fb fb fb fb fb fc
>ffff88802cec9180: fc fa fb fb fb fb fc fc fa fb fb fb fb fc fc fb
                            ^
 ffff88802cec9200: fb fb fb fb fc fc fb fb fb fb fb fc fc fb fb fb
 ffff88802cec9280: fb fb fc fc fa fb fb fb fb fc fc fa fb fb fb fb
==================================================================

The root cause of this bug is the unlocked access to xfs_inode.i_afp
from the getxattr code paths while trying to determine which ILOCK mode
to use to stabilize the xattr data.  Unfortunately, the VFS does not
acquire i_rwsem when vfs_getxattr (or listxattr) call into the
filesystem, which means that getxattr can race with a removexattr that's
tearing down the attr fork and crash:

xfs_attr_set:                          xfs_attr_get:
xfs_attr_fork_remove:                  xfs_ilock_attr_map_shared:

xfs_idestroy_fork(ip->i_afp);
kmem_cache_free(xfs_ifork_cache, ip->i_afp);

                                       if (ip->i_afp &&

ip->i_afp = NULL;

                                           xfs_need_iread_extents(ip->i_afp))
                                       <KABOOM>

ip->i_forkoff = 0;

Regrettably, the VFS is much more lax about i_rwsem and getxattr than
is immediately obvious -- not only does it not guarantee that we hold
i_rwsem, it actually doesn't guarantee that we *don't* hold it either.
The getxattr system call won't acquire the lock before calling XFS, but
the file capabilities code calls getxattr with and without i_rwsem held
to determine if the "security.capabilities" xattr is set on the file.

Fixing the VFS locking requires a treewide investigation into every code
path that could touch an xattr and what i_rwsem state it expects or sets
up.  That could take years or even prove impossible; fortunately, we
can fix this UAF problem inside XFS.

An earlier version of this patch used smp_wmb in xfs_attr_fork_remove to
ensure that i_forkoff is always zeroed before i_afp is set to null and
changed the read paths to use smp_rmb before accessing i_forkoff and
i_afp, which avoided these UAF problems.  However, the patch author was
too busy dealing with other problems in the meantime, and by the time he
came back to this issue, the situation had changed a bit.

On a modern system with selinux, each inode will always have at least
one xattr for the selinux label, so it doesn't make much sense to keep
incurring the extra pointer dereference.  Furthermore, Allison's
upcoming parent pointer patchset will also cause nearly every inode in
the filesystem to have extended attributes.  Therefore, make the inode
attribute fork structure part of struct xfs_inode, at a cost of 40 more
bytes.

This patch adds a clunky if_present field where necessary to maintain
the existing logic of xattr fork null pointer testing in the existing
codebase.  The next patch switches the logic over to XFS_IFORK_Q and it
all goes away.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_attr.c
	fs/xfs/libxfs/xfs_attr.h
	fs/xfs/libxfs/xfs_attr_leaf.c
	fs/xfs/libxfs/xfs_bmap.c
	fs/xfs/libxfs/xfs_inode_buf.c
	fs/xfs/libxfs/xfs_inode_fork.c
	fs/xfs/libxfs/xfs_inode_fork.h
	fs/xfs/xfs_attr_inactive.c
	fs/xfs/xfs_attr_list.c
	fs/xfs/xfs_icache.c
	fs/xfs/xfs_inode.c
	fs/xfs/xfs_inode.h
	fs/xfs/xfs_inode_item.c
	fs/xfs/xfs_itable.c
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

732df364

xfs: convert XFS_IFORK_PTR to a static inline helper · 47998bc8

由 Darrick J. Wong 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.19-rc5
commit 732436ef
category: bugfix
bugzilla: 187164, https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=732436ef916b4f338d672ea56accfdb11e8d0732

--------------------------------

We're about to make this logic do a bit more, so convert the macro to a
static inline function for better typechecking and fewer shouty macros.
No functional changes here.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NDave Chinner <dchinner@redhat.com>

conflicts:
	fs/xfs/libxfs/xfs_bmap.c
	fs/xfs/libxfs/xfs_bmap_btree.c
	fs/xfs/libxfs/xfs_inode_fork.c
	fs/xfs/libxfs/xfs_inode_fork.h
	fs/xfs/scrub/bmap.c
	fs/xfs/scrub/symlink.c
	fs/xfs/xfs_inode.c
	fs/xfs/xfs_ioctl.c
	fs/xfs/xfs_qm.c
	fs/xfs/xfs_reflink.c
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

47998bc8

xfs: don't reuse busy extents on extent trim · bbfe8670

由 Brian Foster 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.11-rc4
commit 06058bc4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=06058bc40534530e617e5623775c53bb24f032cb

--------------------------------

Freed extents are marked busy from the point the freeing transaction
commits until the associated CIL context is checkpointed to the log.
This prevents reuse and overwrite of recently freed blocks before
the changes are committed to disk, which can lead to corruption
after a crash. The exception to this rule is that metadata
allocation is allowed to reuse busy extents because metadata changes
are also logged.

As of commit 97d3ac75 ("xfs: exact busy extent tracking"), XFS
has allowed modification or complete invalidation of outstanding
busy extents for metadata allocations. This implementation assumes
that use of the associated extent is imminent, which is not always
the case. For example, the trimmed extent might not satisfy the
minimum length of the allocation request, or the allocation
algorithm might be involved in a search for the optimal result based
on locality.

generic/019 reproduces a corruption caused by this scenario. First,
a metadata block (usually a bmbt or symlink block) is freed from an
inode. A subsequent bmbt split on an unrelated inode attempts a near
mode allocation request that invalidates the busy block during the
search, but does not ultimately allocate it. Due to the busy state
invalidation, the block is no longer considered busy to subsequent
allocation. A direct I/O write request immediately allocates the
block and writes to it. Finally, the filesystem crashes while in a
state where the initial metadata block free had not committed to the
on-disk log. After recovery, the original metadata block is in its
original location as expected, but has been corrupted by the
aforementioned dio.

This demonstrates that it is fundamentally unsafe to modify busy
extent state for extents that are not guaranteed to be allocated.
This applies to pretty much all of the code paths that currently
trim busy extents for one reason or another. Therefore to address
this problem, drop the reuse mechanism from the busy extent trim
path. This code already knows how to return partial non-busy ranges
of the targeted free extent and higher level code tracks the busy
state of the allocation attempt. If a block allocation fails where
one or more candidate extents is busy, we force the log and retry
the allocation.
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

bbfe8670

fs/xfs: convert comma to semicolon · eaff78ee

由 Zheng Yongjun 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.10-rc5
commit 1189686e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1189686e5440041057f8cc21a7c1d13bb6642cb9

--------------------------------

Replace a comma between expression statements by a semicolon.
Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NEric Sandeen <sandeen@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

eaff78ee

xfs: xfs_ail_push_all_sync() stalls when racing with updates · 6efc0ef9

由 Dave Chinner 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.17-rc6
commit 941fbdfd
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=941fbdfd6dd0f1d7961c28123b5460912f678cb5

--------------------------------

xfs_ail_push_all_sync() has a loop like this:

while max_ail_lsn {
	prepare_to_wait(ail_empty)
	target = max_ail_lsn
	wake_up(ail_task);
	schedule()
}

Which is designed to sleep until the AIL is emptied. When
xfs_ail_update_finish() moves the tail of the log, it does:

	if (list_empty(&ailp->ail_head))
		wake_up_all(&ailp->ail_empty);

So it will only wake up the sync push waiter when the AIL goes
empty. If, by the time the push waiter has woken, the AIL has more
in it, it will reset the target, wake the push task and go back to
sleep.

The problem here is that if the AIL is having items added to it
when xfs_ail_push_all_sync() is called, then they may get inserted
into the AIL at a LSN higher than the target LSN. At this point,
xfsaild_push() will see that the target is X, the item LSNs are
(X+N) and skip over them, hence never pushing the out.

The result of this the AIL will not get emptied by the AIL push
thread, hence xfs_ail_finish_update() will never see the AIL being
empty even if it moves the tail. Hence xfs_ail_push_all_sync() never
gets woken and hence cannot update the push target to capture the
items beyond the current target on the LSN.

This is a TOCTOU type of issue so the way to avoid it is to not
use the push target at all for sync pushes. We know that a sync push
is being requested by the fact the ail_empty wait queue is active,
hence the xfsaild can just set the target to max_ail_lsn on every
push that we see the wait queue active. Hence we no longer will
leave items on the AIL that are beyond the LSN sampled at the start
of a sync push.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

6efc0ef9

xfs: check buffer pin state after locking in delwri_submit · ceb8e84c

由 Dave Chinner 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.17-rc6
commit dbd0f529
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=dbd0f5299302f8506637592e2373891a748c6990

--------------------------------

AIL flushing can get stuck here:

[316649.005769] INFO: task xfsaild/pmem1:324525 blocked for more than 123 seconds.
[316649.007807]       Not tainted 5.17.0-rc6-dgc+ #975
[316649.009186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[316649.011720] task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
[316649.014112] Call Trace:
[316649.014841]  <TASK>
[316649.015492]  __schedule+0x30d/0x9e0
[316649.017745]  schedule+0x55/0xd0
[316649.018681]  io_schedule+0x4b/0x80
[316649.019683]  xfs_buf_wait_unpin+0x9e/0xf0
[316649.021850]  __xfs_buf_submit+0x14a/0x230
[316649.023033]  xfs_buf_delwri_submit_buffers+0x107/0x280
[316649.024511]  xfs_buf_delwri_submit_nowait+0x10/0x20
[316649.025931]  xfsaild+0x27e/0x9d0
[316649.028283]  kthread+0xf6/0x120
[316649.030602]  ret_from_fork+0x1f/0x30

in the situation where flushing gets preempted between the unpin
check and the buffer trylock under nowait conditions:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

This means submission is stuck until something else triggers a log
force to unpin the buffer.

To get onto the delwri list to begin with, the buffer pin state has
already been checked, and hence it's relatively rare we get a race
between flushing and encountering a pinned buffer in delwri
submission to begin with. Further, to increase the pin count the
buffer has to be locked, so the only way we can hit this race
without failing the trylock is to be preempted between the pincount
check seeing zero and the trylock being run.

Hence to avoid this problem, just invert the order of trylock vs
pin check. We shouldn't hit that many pinned buffers here, so
optimising away the trylock for pinned buffers should not matter for
performance at all.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

ceb8e84c

xfs: log worker needs to start before intent/unlink recovery · 080ca40e

由 Dave Chinner 提交于 4月 12, 2023

mainline inclusion
from mainline-v5.17-rc6
commit a9a4bc8c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a4bc8c76d747aa40b30e2dfc176c781f353a08

--------------------------------

After 963 iterations of generic/530, it deadlocked during recovery
on a pinned inode cluster buffer like so:

XFS (pmem1): Starting recovery (logdev: internal)
INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
      Not tainted 5.17.0-rc6-dgc+ #975
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iunlink_map_ino+0x66/0xd0
 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
 xfs_iunlink_remove_inode+0xf2/0x1d0
 xfs_inactive_ifree+0x1a3/0x900
 xfs_inode_unlink+0xcc/0x210
 xfs_inodegc_worker+0x1ac/0x2f0
 process_one_work+0x1ac/0x390
 worker_thread+0x56/0x3c0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30
 </TASK>
task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iget+0x300/0xb40
 xlog_recover_process_one_iunlink+0x4c/0x170
 xlog_recover_process_iunlinks.isra.0+0xee/0x130
 xlog_recover_finish+0x57/0x110
 xfs_log_mount_finish+0xfc/0x1e0
 xfs_mountfs+0x540/0x910
 xfs_fs_fill_super+0x495/0x850
 get_tree_bdev+0x171/0x270
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x304/0xba0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
 </TASK>
task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 io_schedule+0x4b/0x80
 xfs_buf_wait_unpin+0x9e/0xf0
 __xfs_buf_submit+0x14a/0x230
 xfs_buf_delwri_submit_buffers+0x107/0x280
 xfs_buf_delwri_submit_nowait+0x10/0x20
 xfsaild+0x27e/0x9d0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30

We have the mount process waiting on an inode cluster buffer read,
inodegc doing unlink waiting on the same inode cluster buffer, and
the AIL push thread blocked in writeback waiting for the inode
cluster buffer to become unpinned.

What has happened here is that the AIL push thread has raced with
the inodegc process modifying, committing and pinning the inode
cluster buffer here in xfs_buf_delwri_submit_buffers() here:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

Basically, the AIL has found the buffer wasn't pinned and got the
lock without blocking, but then the buffer was pinned. This implies
the processing here was pre-empted between the pin check and the
lock, because the pin count can only be increased while holding the
buffer locked. Hence when it has gone to submit the IO, it has
blocked waiting for the buffer to be unpinned.

With all executing threads now waiting on the buffer to be unpinned,
we normally get out of situations like this via the background log
worker issuing a log force which will unpinned stuck buffers like
this. But at this point in recovery, we haven't started the log
worker. In fact, the first thing we do after processing intents and
unlinked inodes is *start the log worker*. IOWs, we start it too
late to have it break deadlocks like this.

Avoid this and any other similar deadlock vectors in intent and
unlinked inode recovery by starting the log worker before we recover
intents and unlinked inodes. This part of recovery runs as though
the filesystem is fully active, so we really should have the same
infrastructure running as we normally do at runtime.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NLong Li <leo.lilong@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

080ca40e

10 4月, 2023 1 次提交

!256 sched: Supprot dynamic affinity in scheduler · 23c0e711

由 openeuler-ci-bot 提交于 4月 10, 2023

Merge Pull Request from: @zhangjian210 
 
This pathchset support dynamic affinity feature.

Dynamic affinity set preferred cpus for task. When the utilization of
taskgroup's preferred cpu is low, task only run in cpus preferred to
enhance cpu resource locality and reduce interference between task cgroups,
otherwise task can burst preferred cpus to use external cpu within
cpus allowed. 
 
Link:https://gitee.com/openeuler/kernel/pulls/256 

Reviewed-by: Zucheng Zheng <zhengzucheng@huawei.com> 
Reviewed-by: Jialin Zhang <zhangjialin11@huawei.com> 
Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>

23c0e711

08 4月, 2023 6 次提交

config: enable CONFIG_QOS_SCHED_DYNAMIC_AFFINITY by default · 9ef80982

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

9ef80982

sched: Add statistics for scheduler dynamic affinity · e287d8eb

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

e287d8eb

sched: Adjust cpu allowed in load balance dynamicly · bcddff7e

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Not allow task to migrate out of cpu preferred.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

bcddff7e

sched: Adjust wakeup cpu range according CPU util dynamicly · 2a3bb3c0

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Compare taskgroup 'util_avg' in perferred cpu with capacity preferred cpu,
dynamicly adjust cpu range for task wakeup process.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

2a3bb3c0

cpuset: Introduce new interface for scheduler dynamic affinity · ebeb84ad

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Add 'prefer_cpus' sysfs and related interface in cgroup cpuset.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

ebeb84ad

sched: Introduce dynamic affinity for cfs scheduler · 229f5c1f

由 tanghui 提交于 11月 16, 2022

hulk inclusion
category: feature
bugzilla: 186575, https://gitee.com/openeuler/kernel/issues/I526XC

--------------------------------

Dynamic affinity set preferred cpus for task. When the utilization of
taskgroup's preferred cpu is low, task only run in cpus preferred to
enhance cpu resource locality and reduce interference between task cgroups,
otherwise task can burst preferred cpus to use external cpu within
cpus allowed.
Signed-off-by: Ntanghui <tanghui20@huawei.com>
Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>

229f5c1f

07 4月, 2023 5 次提交

!323 [OLK-5.10] sched: Introduce priority load balance for CFS · 80748ad9

由 openeuler-ci-bot 提交于 4月 07, 2023

Merge Pull Request from: @zhangsong234 
 
euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5HF3M
CVE: NA

Add new sysctl interface:
`/proc/sys/kernel/sched_prio_load_balance_enabled`

 0: default behavior
 1: enable priority load balance for qos scheduler

For tasks co-location with qos scheduler, when CFS do load balance,
it is reasonable to prefer migrating online(Latency Sensitive) tasks.
So the CFS load balance can be changed to below:

1) `cfs_tasks` list is owned by online tasks.
2) Add new `cfs_offline_tasks` list which is owned by offline tasks.
3) Prefer to migrate the online tasks of `cfs_tasks` list to dst rq.

In the scenario of hyperthread interference, if the smt expeller feature
enabled, CPU A and CPU B are two hyperthreads on a physical core,
CPU A runs online tasks while CPU B only has offline tasks, The offline
tasks on CPU B are expelled by the online tasks on CPU A and cannot be
scheduled. However, when load balance is triggered, before CPU B can
migrate some online tasks from CPU A, the load on the two cpus is already
balanced. As a result, CPU B cannot run online tasks and online tasks
cannot be evenly distributed among different cpus. 
 
Link:https://gitee.com/openeuler/kernel/pulls/323 

Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com> 
Reviewed-by: Zucheng Zheng <zhengzucheng@huawei.com> 
Reviewed-by: Liu Chao <liuchao173@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

80748ad9

sched/fair: Avoid offline tasks starve to death for priority load balance · 29ec6e95

由 Song Zhang 提交于 11月 11, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I60S5R
CVE: NA

Reference: NA

--------------------------------

When priority load balance enabled for online/offline tasks co-location,
offline tasks maybe starve to death by load balance detach_tasks()
if env->loop exceed env->loop_max and can not be migrate to idle cpus.

So we should set env->loop to 0 and detach offline tasks again.

Fixes: fbfd4454 ("sched: Introduce priority load balance for CFS")
Signed-off-by: NSong Zhang <zhangsong34@huawei.com>

29ec6e95

config: enable CONFIG_QOS_SCHED_PRIO_LB for x86 and arm64 · 44113df5

由 zhangsong 提交于 7月 29, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5HF3M
CVE: NA

Reference: https://gitee.com/openeuler/kernel/pulls/41

--------------------------------

set CONFIG_QOS_SCHED_PRIO_LB to y:

This feature enable priority load balance
for QoS scheduler, which prefer migrating online tasks
and migrating offline tasks secondly.
Signed-off-by: Nzhangsong <zhangsong34@huawei.com>

44113df5

sched: Introduce priority load balance for CFS · 7d655d41

由 zhangsong 提交于 3月 05, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5HF3M
CVE: NA

Reference: NA

--------------------------------

Add new sysctl interface:
`/proc/sys/kernel/sched_prio_load_balance_enabled`

 0: default behavior
 1: enable priority load balance for qos scheduler

For tasks co-location with qos scheduler, when CFS do load balance,
it is reasonable to prefer migrating online(Latency Sensitive) tasks.
So the CFS load balance can be changed to below:

1) `cfs_tasks` list is owned by online tasks.
2) Add new `cfs_offline_tasks` list which is owned by offline tasks.
3) Prefer to migrate the online tasks of `cfs_tasks` list to dst rq.
Signed-off-by: Nzhangsong <zhangsong34@huawei.com>
Reviewed-by: NZhang Qiao <zhangqiao22@huawei.com>
--------------------------------
V2->V3:
- remove skip_migrate_task for load balance
V1->V2:
- remove setting cpu shares for offline cgroup

 Conflicts:
	kernel/sched/sched.h

7d655d41

!563 Backport CVEs and bugfixes · 7a9bd93b

由 openeuler-ci-bot 提交于 4月 07, 2023

Merge Pull Request from: @zhangjialin11 
 
Pull new CVEs:
CVE-2023-1513
CVE-2022-4269

net bugfixes from Zhengchao Shao
ext4 bugfixes from Zhihao Cheng and Baokun Li
mm bugfix from Ma Wupeng
timer bugfix from Yu Liao
nvme bugfix from Li Lingfeng
xfs bugfixes from Zhihao Cheng
scsi bugfix from Yu Kuai 
 
Link:https://gitee.com/openeuler/kernel/pulls/563 

Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com> 
Signed-off-by: Zheng Zengkai <zhengzengkai@huawei.com>

7a9bd93b

04 4月, 2023 11 次提交

net: sched: Use struct_size() helper in kvmalloc() · c5eeec3b

由 Gustavo A. R. Silva 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.16-rc1
commit 12929198
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6SFHJ
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=129291980f4901ae68ee3ba4344bdc38cf5f800d

--------------------------------

Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.

Link: https://github.com/KSPP/linux/issues/160Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20210929201718.GA342296@embeddedorSigned-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: NLiu Jian <liujian56@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

c5eeec3b

net_sched: Use struct_size() and flex_array_size() helpers · 1c7c1029

由 Gustavo A. R. Silva 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.16-rc1
commit 69508d43
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6SFHJ
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=69508d43334e3b09c344f662272bcf24a5b508ed

--------------------------------

Make use of the struct_size() and flex_array_size() helpers instead of
an open-coded version, in order to avoid any potential type mistakes
or integer overflows that, in the worse scenario, could lead to heap
overflows.

Link: https://github.com/KSPP/linux/issues/160Signed-off-by: NGustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20210928193107.GA262595@embeddedorSigned-off-by: NJakub Kicinski <kuba@kernel.org>

Conflicts:
	net/sched/sch_api.c
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: NLiu Jian <liujian56@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

1c7c1029

ext4: dio take shared inode lock when overwriting preallocated blocks · 5193a88e

由 Zhang Yi 提交于 4月 04, 2023

mainline inclusion
from mainline-v6.3-rc1
commit 240930fb
category: perf
bugzilla: https://gitee.com/openeuler/kernel/issues/I6S63P
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=240930fb7e6b52229bdee5b1423bfeab0002fed2

--------------------------------

In the dio write path, we only take shared inode lock for the case of
aligned overwriting initialized blocks inside EOF. But for overwriting
preallocated blocks, it may only need to split unwritten extents, this
procedure has been protected under i_data_sem lock, it's safe to
release the exclusive inode lock and take shared inode lock.

This could give a significant speed up for multi-threaded writes. Test
on Intel Xeon Gold 6140 and nvme SSD with below fio parameters.

 direct=1
 ioengine=libaio
 iodepth=10
 numjobs=10
 runtime=60
 rw=randwrite
 size=100G

And the test result are:
Before:
 bs=4k       IOPS=11.1k, BW=43.2MiB/s
 bs=16k      IOPS=11.1k, BW=173MiB/s
 bs=64k      IOPS=11.2k, BW=697MiB/s

After:
 bs=4k       IOPS=41.4k, BW=162MiB/s
 bs=16k      IOPS=41.3k, BW=646MiB/s
 bs=64k      IOPS=13.5k, BW=843MiB/s
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20221226062015.3479416-1-yi.zhang@huaweicloud.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Conflicts:
	fs/ext4/file.c
	[ 2f632965("iomap: pass a flags argument to iomap_dio_rw")
	  is not applied. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

5193a88e

ext4: fix race between writepages and remount · b66c23ec

由 Baokun Li 提交于 4月 04, 2023

hulk inclusion
category: bugfix
bugzilla: 188500, https://gitee.com/openeuler/kernel/issues/I6RJ0V
CVE: NA

--------------------------------

We got a WARNING in ext4_add_complete_io:
==================================================================
 WARNING: at fs/ext4/page-io.c:231 ext4_put_io_end_defer+0x182/0x250
 CPU: 10 PID: 77 Comm: ksoftirqd/10 Tainted: 6.3.0-rc2 #85
 RIP: 0010:ext4_put_io_end_defer+0x182/0x250 [ext4]
 [...]
 Call Trace:
  <TASK>
  ext4_end_bio+0xa8/0x240 [ext4]
  bio_endio+0x195/0x310
  blk_update_request+0x184/0x770
  scsi_end_request+0x2f/0x240
  scsi_io_completion+0x75/0x450
  scsi_finish_command+0xef/0x160
  scsi_complete+0xa3/0x180
  blk_complete_reqs+0x60/0x80
  blk_done_softirq+0x25/0x40
  __do_softirq+0x119/0x4c8
  run_ksoftirqd+0x42/0x70
  smpboot_thread_fn+0x136/0x3c0
  kthread+0x140/0x1a0
  ret_from_fork+0x2c/0x50
==================================================================

Above issue may happen as follows:

            cpu1                        cpu2
----------------------------|----------------------------
mount -o dioread_lock
ext4_writepages
 ext4_do_writepages
  *if (ext4_should_dioread_nolock(inode))*
    // rsv_blocks is not assigned here
                                 mount -o remount,dioread_nolock
  ext4_journal_start_with_reserve
   __ext4_journal_start
    __ext4_journal_start_sb
     jbd2__journal_start
      *if (rsv_blocks)*
        // h_rsv_handle is not initialized here
  mpage_map_and_submit_extent
    mpage_map_one_extent
      dioread_nolock = ext4_should_dioread_nolock(inode)
      if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN))
        mpd->io_submit.io_end->handle = handle->h_rsv_handle
        ext4_set_io_unwritten_flag
          io_end->flag |= EXT4_IO_END_UNWRITTEN
      // now io_end->handle is NULL but has EXT4_IO_END_UNWRITTEN flag

scsi_finish_command
 scsi_io_completion
  scsi_io_completion_action
   scsi_end_request
    blk_update_request
     req_bio_endio
      bio_endio
       bio->bi_end_io  > ext4_end_bio
        ext4_put_io_end_defer
	 ext4_add_complete_io
	  // trigger WARN_ON(!io_end->handle && sbi->s_journal);

The immediate cause of this problem is that ext4_should_dioread_nolock()
function returns inconsistent values in the ext4_do_writepages() and
mpage_map_one_extent(). There are four conditions in this function that
can be changed at mount time to cause this problem. These four conditions
can be divided into two categories:

    (1) journal_data and EXT4_EXTENTS_FL, which can be changed by ioctl
    (2) DELALLOC and DIOREAD_NOLOCK, which can be changed by remount

The two in the first category have been fixed by commit c8585c6f
("ext4: fix races between changing inode journal mode and ext4_writepages")
and commit cb85f4d2 ("ext4: fix race between writepages and enabling
EXT4_EXTENTS_FL") respectively.

Two cases in the other category have not yet been fixed, and the above
issue is caused by this situation. We refer to the fix for the first
category, when applying options during remount, we grab s_writepages_rwsem
to avoid racing with writepages ops to trigger this problem.

Fixes: 6b523df4 ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Cc: stable@vger.kernel.org
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

b66c23ec

mm: mem_reliable: Initialize reliable_nr_page when mm_init() · 97607032

由 Ma Wupeng 提交于 4月 04, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6RKHX
CVE: NA

--------------------------------

After the fork operation, it is erroneous for the child process to have a
reliable page size twice that of its parent process.

Upon examining the mm_struct structure, it was discovered that
reliable_nr_page should be initialized to 0, similar to how RSS is
initialized during mm_init(). This particular problem that arises during
forking is merely one such example.

To resolve this issue, it is recommended to set reliable_nr_page to 0
during the mm_init() operation.

Fixes: d81e9624 ("proc: Count reliable memory usage of reliable tasks")
Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

97607032

clocksource/drivers/arm_arch_timer: Fix CNTPCT_LO and CNTVCT_LO value · 5b749c1f

由 Yang Guo 提交于 4月 04, 2023

mainline inclusion
from mainline-v6.1-rc1
commit af246cc6
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6RQVI

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=af246cc6d0ed11318223606128bb0b09866c4c08

--------------------------------

CNTPCT_LO and CNTVCT_LO are defined by mistake in commit '8b82c4f8',
so fix them according to the Arm ARM DDI 0487I.a, Table I2-4
"CNTBaseN memory map" as follows:

Offset    Register      Type Description
0x000     CNTPCT[31:0]  RO   Physical Count register.
0x004     CNTPCT[63:32] RO
0x008     CNTVCT[31:0]  RO   Virtual Count register.
0x00C     CNTVCT[63:32] RO

Fixes: 8b82c4f8 ("clocksource/drivers/arm_arch_timer: Move MMIO timer programming over to CVAL")
Cc: stable@vger.kernel.org
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Acked-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NYang Guo <guoyang2@huawei.com>
Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
Link: https://lore.kernel.org/r/20220927033221.49589-1-zhangshaokun@hisilicon.comSigned-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

5b749c1f

kvm: initialize all of the kvm_debugregs structure before sending it to userspace · b3f6c639

由 Greg Kroah-Hartman 提交于 4月 04, 2023

stable inclusion
from stable-v5.10.169
commit 6416c2108ba54d569e4c98d3b62ac78cb12e7107
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6OOP3
CVE: CVE-2023-1513

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6416c2108ba54d569e4c98d3b62ac78cb12e7107

--------------------------------

commit 2c10b614 upstream.

When calling the KVM_GET_DEBUGREGS ioctl, on some configurations, there
might be some unitialized portions of the kvm_debugregs structure that
could be copied to userspace.  Prevent this as is done in the other kvm
ioctls, by setting the whole structure to 0 before copying anything into
it.

Bonus is that this reduces the lines of code as the explicit flag
setting and reserved space zeroing out can be removed.

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: <x86@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: stable <stable@kernel.org>
Reported-by: NXingyuan Mo <hdthky0@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Message-Id: <20230214103304.3689213-1-gregkh@linuxfoundation.org>
Tested-by: NXingyuan Mo <hdthky0@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

b3f6c639

nvme: use nvme_cid to generate command_id in trace event · 454dacfc

由 Li Lingfeng 提交于 4月 04, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6QMTU

--------------------------------

Recently, a null-ptr-deref problem occur when submitting IO to nvme disk:

[34432.226539] ==========================================================
[34432.226579] BUG: KASAN: null-ptr-deref in
trace_event_raw_event_nvme_complete_rq+0x13c/0x270 [nvme_core]
[34432.226584] Read of size 2 at addr 0000000000000002 by task loop0/32
[34432.226586]
[34432.226594] CPU: 0 PID: 3242729 Comm: loop0 Kdump: loaded Tainted:
[34432.226598] Hardware name: Huawei TaiShan 2280 V2/BC82AMDD
[34432.226602] Call trace:
[34432.226610]  dump_backtrace+0x0/0x2fc
[34432.226615]  show_stack+0x20/0x30
[34432.226623]  dump_stack+0x104/0x17c
[34432.226630]  __kasan_report+0x138/0x140
[34432.226634]  kasan_report+0x44/0xdc
[34432.226639]  __asan_load2+0x90/0xd0
[34432.226662]  trace_event_raw_event_nvme_complete_rq+0x13c/0x270
[34432.226684]  nvme_complete_rq+0x228/0x480 [nvme_core]
[34432.226698]  nvme_pci_complete_rq+0x184/0x1b4 [nvme]
[34432.226706]  nvme_irq+0x270/0x500 [nvme]
[34432.226714]  __handle_irq_event_percpu+0x8c/0x324
[34432.226719]  handle_irq_event_percpu+0x88/0x11c
[34432.226724]  handle_irq_event+0x110/0x2b0
[34432.226729]  handle_fasteoi_irq+0x1e4/0x3f4
[34432.226734]  __handle_domain_irq+0xbc/0x130
[34432.226739]  gic_handle_irq+0x78/0x460
[34432.226743]  el1_irq+0xb8/0x140
[34432.226750]  __slab_alloc+0x38/0x70
[34432.226756]  kmem_cache_alloc+0x6b8/0x904
[34432.226762]  mempool_alloc_slab+0x3c/0x60
[34432.226766]  mempool_alloc+0xf0/0x440
[34432.226772]  bio_alloc_bioset+0x208/0x2f0
[34432.226899]  io_submit_init_bio+0x3c/0x190 [ext4]
[34432.226991]  ext4_bio_write_page+0x540/0xbd0 [ext4]
[34432.227082]  mpage_submit_page+0xb0/0x120 [ext4]
[34432.227173]  mpage_process_page_bufs+0x25c/0x2b4 [ext4]
[34432.227265]  mpage_prepare_extent_to_map+0x3b8/0x75c [ext4]
[34432.227356]  ext4_writepages+0x454/0xcb4 [ext4]
[34432.227361]  do_writepages+0xc4/0x1c0
...

This can be reproduced by following steps:
1) modprobe nvme
2) echo nvme:* > /sys/kernel/debug/tracing/set_event
3) dd if=/dev/random of=/dev/nvmexxx bs=1M count=1024

Generating command_id by nvme_cid() in trace event instead of
nvme_req(req)->cmd->common.command_id can fix it since
nvme_req(req)->cmd can be NULL in sometimes.

Fixes: bbd033cb ("nvme: use command_id instead of req->tag in trace_nvme_complete_rq()")
Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

454dacfc

xfs: don't report reserved bnobt space as available · 21b407e1

由 Darrick J. Wong 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.18-rc1
commit 85bcfa26
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=85bcfa26f9a3782be37d4feafd49668b98b8bdbe

--------------------------------

On a modern filesystem, we don't allow userspace to allocate blocks for
data storage from the per-AG space reservations, the user-controlled
reservation pool that prevents ENOSPC in the middle of internal
operations, or the internal per-AG set-aside that prevents unwanted
filesystem shutdowns due to ENOSPC during a bmap btree split.

Since we now consider freespace btree blocks as unavailable for
allocation for data storage, we shouldn't report those blocks via statfs
either.  This makes the numbers that we return via the statfs f_bavail
and f_bfree fields a more conservative estimate of actual free space.
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

21b407e1

xfs: don't include bnobt blocks when reserving free block pool · 36dce62e

由 Darrick J. Wong 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.18-rc1
commit c8c56825
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c8c568259772751a14e969b7230990508de73d9d

--------------------------------

xfs_reserve_blocks controls the size of the user-visible free space
reserve pool.  Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks.  However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us.  If we forget to
subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will
return ENOSPC and we'll hang the kernel at mount due to the infinite
loop.

In commit fd43cf60, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion.  Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.

Fix this by creating a function to estimate the number of blocks that
can be reserved from fdblocks, which needs to exclude the set-aside and
m_allocbt_blks.

Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.

Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf60 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Conflicts:
	fs/xfs/xfs_mount.h
	[ 15f04fdc("xfs: remove infinite loop when reserving
	  free block pool") applied earlier. ]
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

36dce62e

xfs: set aside allocation btree blocks from block reservation · d530ebb5

由 Brian Foster 提交于 4月 04, 2023

mainline inclusion
from mainline-v5.13-rc1
commit fd43cf60
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fd43cf600cf61c66ae0a1021aca2f636115c7fcb

--------------------------------

The blocks used for allocation btrees (bnobt and countbt) are
technically considered free space. This is because as free space is
used, allocbt blocks are removed and naturally become available for
traditional allocation. However, this means that a significant
portion of free space may consist of in-use btree blocks if free
space is severely fragmented.

On large filesystems with large perag reservations, this can lead to
a rare but nasty condition where a significant amount of physical
free space is available, but the majority of actual usable blocks
consist of in-use allocbt blocks. We have a record of a (~12TB, 32
AG) filesystem with multiple AGs in a state with ~2.5GB or so free
blocks tracked across ~300 total allocbt blocks, but effectively at
100% full because the the free space is entirely consumed by
refcountbt perag reservation.

Such a large perag reservation is by design on large filesystems.
The problem is that because the free space is so fragmented, this AG
contributes the 300 or so allocbt blocks to the global counters as
free space. If this pattern repeats across enough AGs, the
filesystem lands in a state where global block reservation can
outrun physical block availability. For example, a streaming
buffered write on the affected filesystem continues to allow delayed
allocation beyond the point where writeback starts to fail due to
physical block allocation failures. The expected behavior is for the
delalloc block reservation to fail gracefully with -ENOSPC before
physical block allocation failure is a possibility.

To address this problem, set aside in-use allocbt blocks at
reservation time and thus ensure they cannot be reserved until truly
available for physical allocation. This allows alloc btree metadata
to continue to reside in free space, but dynamically adjusts
reservation availability based on internal state. Note that the
logic requires that the allocbt counter is fully populated at
reservation time before it is fully effective. We currently rely on
the mount time AGF scan in the perag reservation initialization code
for this dependency on filesystems where it's most important (i.e.
with active perag reservations).
Signed-off-by: NBrian Foster <bfoster@redhat.com>
Reviewed-by: NChandan Babu R <chandanrlinux@gmail.com>
Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

d530ebb5

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功