1. 29 3月, 2023 1 次提交
    • D
      xfs, iomap: limit individual ioend chain lengths in writeback · c5883137
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.17-rc3
      commit ebb7fb15
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ebb7fb1557b1d03b906b668aa2164b51e6b7d19a
      
      --------------------------------
      
      Trond Myklebust reported soft lockups in XFS IO completion such as
      this:
      
       watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [kworker/12:1:3106]
       CPU: 12 PID: 3106 Comm: kworker/12:1 Not tainted 4.18.0-305.10.2.el8_4.x86_64 #1
       Workqueue: xfs-conv/md127 xfs_end_io [xfs]
       RIP: 0010:_raw_spin_unlock_irqrestore+0x11/0x20
       Call Trace:
        wake_up_page_bit+0x8a/0x110
        iomap_finish_ioend+0xd7/0x1c0
        iomap_finish_ioends+0x7f/0xb0
        xfs_end_ioend+0x6b/0x100 [xfs]
        xfs_end_io+0xb9/0xe0 [xfs]
        process_one_work+0x1a7/0x360
        worker_thread+0x1fa/0x390
        kthread+0x116/0x130
        ret_from_fork+0x35/0x40
      
      Ioends are processed as an atomic completion unit when all the
      chained bios in the ioend have completed their IO. Logically
      contiguous ioends can also be merged and completed as a single,
      larger unit.  Both of these things can be problematic as both the
      bio chains per ioend and the size of the merged ioends processed as
      a single completion are both unbound.
      
      If we have a large sequential dirty region in the page cache,
      write_cache_pages() will keep feeding us sequential pages and we
      will keep mapping them into ioends and bios until we get a dirty
      page at a non-sequential file offset. These large sequential runs
      can will result in bio and ioend chaining to optimise the io
      patterns. The pages iunder writeback are pinned within these chains
      until the submission chaining is broken, allowing the entire chain
      to be completed. This can result in huge chains being processed
      in IO completion context.
      
      We get deep bio chaining if we have large contiguous physical
      extents. We will keep adding pages to the current bio until it is
      full, then we'll chain a new bio to keep adding pages for writeback.
      Hence we can build bio chains that map millions of pages and tens of
      gigabytes of RAM if the page cache contains big enough contiguous
      dirty file regions. This long bio chain pins those pages until the
      final bio in the chain completes and the ioend can iterate all the
      chained bios and complete them.
      
      OTOH, if we have a physically fragmented file, we end up submitting
      one ioend per physical fragment that each have a small bio or bio
      chain attached to them. We do not chain these at IO submission time,
      but instead we chain them at completion time based on file
      offset via iomap_ioend_try_merge(). Hence we can end up with unbound
      ioend chains being built via completion merging.
      
      XFS can then do COW remapping or unwritten extent conversion on that
      merged chain, which involves walking an extent fragment at a time
      and running a transaction to modify the physical extent information.
      IOWs, we merge all the discontiguous ioends together into a
      contiguous file range, only to then process them individually as
      discontiguous extents.
      
      This extent manipulation is computationally expensive and can run in
      a tight loop, so merging logically contiguous but physically
      discontigous ioends gains us nothing except for hiding the fact the
      fact we broke the ioends up into individual physical extents at
      submission and then need to loop over those individual physical
      extents at completion.
      
      Hence we need to have mechanisms to limit ioend sizes and
      to break up completion processing of large merged ioend chains:
      
      1. bio chains per ioend need to be bound in length. Pure overwrites
      go straight to iomap_finish_ioend() in softirq context with the
      exact bio chain attached to the ioend by submission. Hence the only
      way to prevent long holdoffs here is to bound ioend submission
      sizes because we can't reschedule in softirq context.
      
      2. iomap_finish_ioends() has to handle unbound merged ioend chains
      correctly. This relies on any one call to iomap_finish_ioend() being
      bound in runtime so that cond_resched() can be issued regularly as
      the long ioend chain is processed. i.e. this relies on mechanism #1
      to limit individual ioend sizes to work correctly.
      
      3. filesystems have to loop over the merged ioends to process
      physical extent manipulations. This means they can loop internally,
      and so we break merging at physical extent boundaries so the
      filesystem can easily insert reschedule points between individual
      extent manipulations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reported-and-tested-by: NTrond Myklebust <trondmy@hammerspace.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Conflicts:
      	include/linux/iomap.h
      	fs/iomap/buffered-io.c
      	fs/xfs/xfs_aops.c
      
      	[ 6e552494 ("iomap: remove unused private field from ioend")
      	  is not applied.
      	  95c4cd05 ("iomap: Convert to_iomap_page to take a folio") is
      	  not applied.
      	  8ffd74e9 ("iomap: Convert bio completions to use folios") is
      	  not applied.
      	  044c6449 ("xfs: drop unused ioend private merge and
      	  setfilesize code") is not applied. ]
      Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      c5883137
  2. 15 3月, 2023 1 次提交
  3. 08 2月, 2023 1 次提交
  4. 31 1月, 2023 1 次提交
    • W
      xfs: Fix deadlock on xfs_inodegc_worker · b6396746
      Wu Guanghao 提交于
      mainline inclusion
      from mainline-v6.2-rc1
      commit 4da11251
      category: bugfix
      bugzilla: 187874,https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4da112513c01d7d0acf1025b8764349d46e177d6
      
      --------------------------------
      
      We are doing a test about deleting a large number of files
      when memory is low. A deadlock problem was found.
      
      [ 1240.279183] -> #1 (fs_reclaim){+.+.}-{0:0}:
      [ 1240.280450]        lock_acquire+0x197/0x460
      [ 1240.281548]        fs_reclaim_acquire.part.0+0x20/0x30
      [ 1240.282625]        kmem_cache_alloc+0x2b/0x940
      [ 1240.283816]        xfs_trans_alloc+0x8a/0x8b0
      [ 1240.284757]        xfs_inactive_ifree+0xe4/0x4e0
      [ 1240.285935]        xfs_inactive+0x4e9/0x8a0
      [ 1240.286836]        xfs_inodegc_worker+0x160/0x5e0
      [ 1240.287969]        process_one_work+0xa19/0x16b0
      [ 1240.289030]        worker_thread+0x9e/0x1050
      [ 1240.290131]        kthread+0x34f/0x460
      [ 1240.290999]        ret_from_fork+0x22/0x30
      [ 1240.291905]
      [ 1240.291905] -> #0 ((work_completion)(&gc->work)){+.+.}-{0:0}:
      [ 1240.293569]        check_prev_add+0x160/0x2490
      [ 1240.294473]        __lock_acquire+0x2c4d/0x5160
      [ 1240.295544]        lock_acquire+0x197/0x460
      [ 1240.296403]        __flush_work+0x6bc/0xa20
      [ 1240.297522]        xfs_inode_mark_reclaimable+0x6f0/0xdc0
      [ 1240.298649]        destroy_inode+0xc6/0x1b0
      [ 1240.299677]        dispose_list+0xe1/0x1d0
      [ 1240.300567]        prune_icache_sb+0xec/0x150
      [ 1240.301794]        super_cache_scan+0x2c9/0x480
      [ 1240.302776]        do_shrink_slab+0x3f0/0xaa0
      [ 1240.303671]        shrink_slab+0x170/0x660
      [ 1240.304601]        shrink_node+0x7f7/0x1df0
      [ 1240.305515]        balance_pgdat+0x766/0xf50
      [ 1240.306657]        kswapd+0x5bd/0xd20
      [ 1240.307551]        kthread+0x34f/0x460
      [ 1240.308346]        ret_from_fork+0x22/0x30
      [ 1240.309247]
      [ 1240.309247] other info that might help us debug this:
      [ 1240.309247]
      [ 1240.310944]  Possible unsafe locking scenario:
      [ 1240.310944]
      [ 1240.312379]        CPU0                    CPU1
      [ 1240.313363]        ----                    ----
      [ 1240.314433]   lock(fs_reclaim);
      [ 1240.315107]                                lock((work_completion)(&gc->work));
      [ 1240.316828]                                lock(fs_reclaim);
      [ 1240.318088]   lock((work_completion)(&gc->work));
      [ 1240.319203]
      [ 1240.319203]  *** DEADLOCK ***
      ...
      [ 2438.431081] Workqueue: xfs-inodegc/sda xfs_inodegc_worker
      [ 2438.432089] Call Trace:
      [ 2438.432562]  __schedule+0xa94/0x1d20
      [ 2438.435787]  schedule+0xbf/0x270
      [ 2438.436397]  schedule_timeout+0x6f8/0x8b0
      [ 2438.445126]  wait_for_completion+0x163/0x260
      [ 2438.448610]  __flush_work+0x4c4/0xa40
      [ 2438.455011]  xfs_inode_mark_reclaimable+0x6ef/0xda0
      [ 2438.456695]  destroy_inode+0xc6/0x1b0
      [ 2438.457375]  dispose_list+0xe1/0x1d0
      [ 2438.458834]  prune_icache_sb+0xe8/0x150
      [ 2438.461181]  super_cache_scan+0x2b3/0x470
      [ 2438.461950]  do_shrink_slab+0x3cf/0xa50
      [ 2438.462687]  shrink_slab+0x17d/0x660
      [ 2438.466392]  shrink_node+0x87e/0x1d40
      [ 2438.467894]  do_try_to_free_pages+0x364/0x1300
      [ 2438.471188]  try_to_free_pages+0x26c/0x5b0
      [ 2438.473567]  __alloc_pages_slowpath.constprop.136+0x7aa/0x2100
      [ 2438.482577]  __alloc_pages+0x5db/0x710
      [ 2438.485231]  alloc_pages+0x100/0x200
      [ 2438.485923]  allocate_slab+0x2c0/0x380
      [ 2438.486623]  ___slab_alloc+0x41f/0x690
      [ 2438.490254]  __slab_alloc+0x54/0x70
      [ 2438.491692]  kmem_cache_alloc+0x23e/0x270
      [ 2438.492437]  xfs_trans_alloc+0x88/0x880
      [ 2438.493168]  xfs_inactive_ifree+0xe2/0x4e0
      [ 2438.496419]  xfs_inactive+0x4eb/0x8b0
      [ 2438.497123]  xfs_inodegc_worker+0x16b/0x5e0
      [ 2438.497918]  process_one_work+0xbf7/0x1a20
      [ 2438.500316]  worker_thread+0x8c/0x1060
      [ 2438.504938]  ret_from_fork+0x22/0x30
      
      When the memory is insufficient, xfs_inonodegc_worker will trigger memory
      reclamation when memory is allocated, then flush_work() may be called to
      wait for the work to complete. This causes a deadlock.
      
      So use memalloc_nofs_save() to avoid triggering memory reclamation in
      xfs_inodegc_worker.
      Signed-off-by: NWu Guanghao <wuguanghao3@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      b6396746
  5. 18 1月, 2023 4 次提交
  6. 11 1月, 2023 1 次提交
    • D
      xfs: fix use-after-free in xattr node block inactivation · a8a4df88
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-v5.19-rc5
      commit 95ff0363
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95ff0363f3f6ae70c21a0f2b0603e54438e5988b
      
      --------------------------------
      
      The kernel build robot reported a UAF error while running xfs/433
      (edited somewhat for brevity):
      
       BUG: KASAN: use-after-free in xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs
       Read of size 4 at addr ffff88820ac2bd44 by task kworker/0:2/139
      
       CPU: 0 PID: 139 Comm: kworker/0:2 Tainted: G S                5.19.0-rc2-00004-g7cf2b0f9 #1
       Hardware name: Hewlett-Packard p6-1451cx/2ADA, BIOS 8.15 02/05/2013
       Workqueue: xfs-inodegc/sdb4 xfs_inodegc_worker [xfs]
       Call Trace:
        <TASK>
       dump_stack_lvl (lib/dump_stack.c:107 (discriminator 1))
       print_address_description+0x1f/0x200
       print_report.cold (mm/kasan/report.c:430)
       kasan_report (mm/kasan/report.c:162 mm/kasan/report.c:493)
       xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:214) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
        </TASK>
      
       Allocated by task 139:
       kasan_save_stack (mm/kasan/common.c:39)
       __kasan_slab_alloc (mm/kasan/common.c:45 mm/kasan/common.c:436 mm/kasan/common.c:469)
       kmem_cache_alloc (mm/slab.h:750 mm/slub.c:3214 mm/slub.c:3222 mm/slub.c:3229 mm/slub.c:3239)
       _xfs_buf_alloc (include/linux/instrumented.h:86 include/linux/atomic/atomic-instrumented.h:41 fs/xfs/xfs_buf.c:232) xfs
       xfs_buf_get_map (fs/xfs/xfs_buf.c:660) xfs
       xfs_buf_read_map (fs/xfs/xfs_buf.c:777) xfs
       xfs_trans_read_buf_map (fs/xfs/xfs_trans_buf.c:289) xfs
       xfs_da_read_buf (fs/xfs/libxfs/xfs_da_btree.c:2652) xfs
       xfs_da3_node_read (fs/xfs/libxfs/xfs_da_btree.c:392) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:272) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
      
       Freed by task 139:
       kasan_save_stack (mm/kasan/common.c:39)
       kasan_set_track (mm/kasan/common.c:45)
       kasan_set_free_info (mm/kasan/generic.c:372)
       __kasan_slab_free (mm/kasan/common.c:368 mm/kasan/common.c:328 mm/kasan/common.c:374)
       kmem_cache_free (mm/slub.c:1753 mm/slub.c:3507 mm/slub.c:3524)
       xfs_buf_rele (fs/xfs/xfs_buf.c:1040) xfs
       xfs_attr3_node_inactive (fs/xfs/xfs_attr_inactive.c:210) xfs
       xfs_attr3_root_inactive (fs/xfs/xfs_attr_inactive.c:296) xfs
       xfs_attr_inactive (fs/xfs/xfs_attr_inactive.c:371) xfs
       xfs_inactive (fs/xfs/xfs_inode.c:1781) xfs
       xfs_inodegc_worker (fs/xfs/xfs_icache.c:1837 fs/xfs/xfs_icache.c:1860) xfs
       process_one_work
       worker_thread
       kthread
       ret_from_fork
      
      I reproduced this for my own satisfaction, and got the same report,
      along with an extra morsel:
      
       The buggy address belongs to the object at ffff88802103a800
        which belongs to the cache xfs_buf of size 432
       The buggy address is located 396 bytes inside of
        432-byte region [ffff88802103a800, ffff88802103a9b0)
      
      I tracked this code down to:
      
      	error = xfs_trans_get_buf(*trans, mp->m_ddev_targp,
      			child_blkno,
      			XFS_FSB_TO_BB(mp, mp->m_attr_geo->fsbcount), 0,
      			&child_bp);
      	if (error)
      		return error;
      	error = bp->b_error;
      
      That doesn't look right -- I think this should be dereferencing
      child_bp, not bp.  Looking through the codebase history, I think this
      was added by commit 2911edb6 ("xfs: remove the mappedbno argument to
      xfs_da_get_buf"), which replaced a call to xfs_da_get_buf with the
      current call to xfs_trans_get_buf.  Not sure why we trans_brelse'd @bp
      earlier in the function, but I'm guessing it's to avoid pinning too many
      buffers in memory while we inactivate the bottom of the attr tree.
      Hence we now have to get the buffer back.
      
      I /think/ this was supposed to check child_bp->b_error and fail the rest
      of the invalidation if child_bp had experienced any kind of IO or
      corruption error.  I bet the xfs_da3_node_read earlier in the loop will
      catch most cases of incoming on-disk corruption which makes this check
      mostly moot unless someone corrupts the buffer and the AIL pushes it out
      to disk while the buffer's unlocked.
      
      In the first case we'll never get to the bad check, and in the second
      case the AIL will shut down the log, at which point there's no reason to
      check b_error.  Remove the check, and null out @bp to avoid this problem
      in the future.
      
      Cc: hch@lst.de
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Fixes: 2911edb6 ("xfs: remove the mappedbno argument to xfs_da_get_buf")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      a8a4df88
  7. 06 1月, 2023 3 次提交
    • G
      xfs: fix super block buf log item UAF during force shutdown · 5a5e896a
      Guo Xuenan 提交于
      mainline inclusion
      from mainline-v6.1-rc4
      commit 575689fc
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=575689fc0ffa6c4bb4e72fd18e31a6525a6124e0
      
      --------------------------------
      
      xfs log io error will trigger xlog shut down, and end_io worker call
      xlog_state_shutdown_callbacks to unpin and release the buf log item.
      The race condition is that when there are some thread doing transaction
      commit and happened not to be intercepted by xlog_is_shutdown, then,
      these log item will be insert into CIL, when unpin and release these
      buf log item, UAF will occur. BTW, add delay before `xlog_cil_commit`
      can increase recurrence probability.
      
      The following call graph actually encountered this bad situation.
      fsstress                    io end worker kworker/0:1H-216
                                  xlog_ioend_work
                                    ->xlog_force_shutdown
                                      ->xlog_state_shutdown_callbacks
                                        ->xlog_cil_process_committed
                                          ->xlog_cil_committed
                                            ->xfs_trans_committed_bulk
      ->xfs_trans_apply_sb_deltas             ->li_ops->iop_unpin(lip, 1);
        ->xfs_trans_getsb
          ->_xfs_trans_bjoin
            ->xfs_buf_item_init
              ->if (bip) { return 0;} //relog
      ->xlog_cil_commit
        ->xlog_cil_insert_items //insert into CIL
                                                 ->xfs_buf_ioend_fail(bp);
                                                   ->xfs_buf_ioend
                                                     ->xfs_buf_item_done
                                                       ->xfs_buf_item_relse
                                                         ->xfs_buf_item_free
      
      when cil push worker gather percpu cil and insert super block buf log item
      into ctx->log_items then uaf occurs.
      
      ==================================================================
      BUG: KASAN: use-after-free in xlog_cil_push_work+0x1c8f/0x22f0
      Write of size 8 at addr ffff88801800f3f0 by task kworker/u4:4/105
      
      CPU: 0 PID: 105 Comm: kworker/u4:4 Tainted: G W
      6.1.0-rc1-00001-g274115149b42 #136
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Workqueue: xfs-cil/sda xlog_cil_push_work
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4d/0x66
       print_report+0x171/0x4a6
       kasan_report+0xb3/0x130
       xlog_cil_push_work+0x1c8f/0x22f0
       process_one_work+0x6f9/0xf70
       worker_thread+0x578/0xf30
       kthread+0x28c/0x330
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Allocated by task 2145:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       __kasan_slab_alloc+0x54/0x60
       kmem_cache_alloc+0x14a/0x510
       xfs_buf_item_init+0x160/0x6d0
       _xfs_trans_bjoin+0x7f/0x2e0
       xfs_trans_getsb+0xb6/0x3f0
       xfs_trans_apply_sb_deltas+0x1f/0x8c0
       __xfs_trans_commit+0xa25/0xe10
       xfs_symlink+0xe23/0x1660
       xfs_vn_symlink+0x157/0x280
       vfs_symlink+0x491/0x790
       do_symlinkat+0x128/0x220
       __x64_sys_symlink+0x7a/0x90
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Freed by task 216:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_save_free_info+0x2a/0x40
       __kasan_slab_free+0x105/0x1a0
       kmem_cache_free+0xb6/0x460
       xfs_buf_ioend+0x1e9/0x11f0
       xfs_buf_item_unpin+0x3d6/0x840
       xfs_trans_committed_bulk+0x4c2/0x7c0
       xlog_cil_committed+0xab6/0xfb0
       xlog_cil_process_committed+0x117/0x1e0
       xlog_state_shutdown_callbacks+0x208/0x440
       xlog_force_shutdown+0x1b3/0x3a0
       xlog_ioend_work+0xef/0x1d0
       process_one_work+0x6f9/0xf70
       worker_thread+0x578/0xf30
       kthread+0x28c/0x330
       ret_from_fork+0x1f/0x30
      
      The buggy address belongs to the object at ffff88801800f388
       which belongs to the cache xfs_buf_item of size 272
      The buggy address is located 104 bytes inside of
       272-byte region [ffff88801800f388, ffff88801800f498)
      
      The buggy address belongs to the physical page:
      page:ffffea0000600380 refcount:1 mapcount:0 mapping:0000000000000000
      index:0xffff88801800f208 pfn:0x1800e
      head:ffffea0000600380 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 001fffff80010200 ffffea0000699788 ffff88801319db50 ffff88800fb50640
      raw: ffff88801800f208 000000000015000a 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88801800f280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88801800f300: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff88801800f380: fc fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                                   ^
       ffff88801800f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88801800f480: fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc fc
      ==================================================================
      Disabling lock debugging due to kernel taint
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5a5e896a
    • G
      xfs: wait iclog complete before tearing down AIL · 1146fdf4
      Guo Xuenan 提交于
      mainline inclusion
      from mainline-v6.1-rc4
      commit 1eb52a6a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1eb52a6a71981b80f9acbd915acd6a05a5037196
      
      --------------------------------
      
      Fix uaf in xfs_trans_ail_delete during xlog force shutdown.
      In commit cd6f79d1 ("xfs: run callbacks before waking waiters in
      xlog_state_shutdown_callbacks") changed the order of running callbacks
      and wait for iclog completion to avoid unmount path untimely destroy AIL.
      But which seems not enough to ensue this, adding mdelay in
      `xfs_buf_item_unpin` can prove that.
      
      The reproduction is as follows. To ensure destroy AIL safely,
      we should wait all xlog ioend workers done and sync the AIL.
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0
      Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43
      
      CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G        W
      6.1.0-rc1-00002-gc28266863c4a #137
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Workqueue: xfs-log/sda xlog_ioend_work
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4d/0x66
       print_report+0x171/0x4a6
       kasan_report+0xb3/0x130
       xfs_trans_ail_delete+0x240/0x2a0
       xfs_buf_item_done+0x7b/0xa0
       xfs_buf_ioend+0x1e9/0x11f0
       xfs_buf_item_unpin+0x4c8/0x860
       xfs_trans_committed_bulk+0x4c2/0x7c0
       xlog_cil_committed+0xab6/0xfb0
       xlog_cil_process_committed+0x117/0x1e0
       xlog_state_shutdown_callbacks+0x208/0x440
       xlog_force_shutdown+0x1b3/0x3a0
       xlog_ioend_work+0xef/0x1d0
       process_one_work+0x6f9/0xf70
       worker_thread+0x578/0xf30
       kthread+0x28c/0x330
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Allocated by task 9606:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       __kasan_kmalloc+0x7a/0x90
       __kmalloc+0x59/0x140
       kmem_alloc+0xb2/0x2f0
       xfs_trans_ail_init+0x20/0x320
       xfs_log_mount+0x37e/0x690
       xfs_mountfs+0xe36/0x1b40
       xfs_fs_fill_super+0xc5c/0x1a70
       get_tree_bdev+0x3c5/0x6c0
       vfs_get_tree+0x85/0x250
       path_mount+0xec3/0x1830
       do_mount+0xef/0x110
       __x64_sys_mount+0x150/0x1f0
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Freed by task 9662:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_save_free_info+0x2a/0x40
       __kasan_slab_free+0x105/0x1a0
       __kmem_cache_free+0x99/0x2d0
       kvfree+0x3a/0x40
       xfs_log_unmount+0x60/0xf0
       xfs_unmountfs+0xf3/0x1d0
       xfs_fs_put_super+0x78/0x300
       generic_shutdown_super+0x151/0x400
       kill_block_super+0x9a/0xe0
       deactivate_locked_super+0x82/0xe0
       deactivate_super+0x91/0xb0
       cleanup_mnt+0x32a/0x4a0
       task_work_run+0x15f/0x240
       exit_to_user_mode_prepare+0x188/0x190
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x42/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The buggy address belongs to the object at ffff888023169400
       which belongs to the cache kmalloc-128 of size 128
      The buggy address is located 0 bytes inside of
       128-byte region [ffff888023169400, ffff888023169480)
      
      The buggy address belongs to the physical page:
      page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000
      index:0xffff888023168f80 pfn:0x23168
      head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0
      raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                         ^
       ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ==================================================================
      Disabling lock debugging due to kernel taint
      
      Fixes: cd6f79d1 ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks")
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1146fdf4
    • G
      xfs: get rid of assert from xfs_btree_islastblock · be18cd15
      Guo Xuenan 提交于
      mainline inclusion
      from mainline-v6.1-rc4
      commit 8c25febf
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c25febf23963431686f04874b96321288504127
      
      --------------------------------
      
      xfs_btree_check_block contains debugging knobs. With XFS_DEBUG setting up,
      turn on the debugging knob can trigger the assert of xfs_btree_islastblock,
      test script as follows:
      
      while true
      do
          mount $disk $mountpoint
          fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null
          echo 1 > /sys/fs/xfs/sda/errortag/btree_chk_sblk
          sleep 10
          umount $mountpoint
      done
      
      Kick off fsstress and only *then* turn on the debugging knob. If it
      happens that the knob gets turned on after the cntbt lookup succeeds
      but before the call to xfs_btree_islastblock, then we *can* end up in
      the situation where a previously checked btree block suddenly starts
      returning EFSCORRUPTED from xfs_btree_check_block. Kaboom.
      
      Darrick give a very detailed explanation as follows:
      Looking back at commit 27d9ee57, I think the point of all this was
      to make sure that the cursor has actually performed a lookup, and that
      the btree block at whatever level we're asking about is ok.
      
      If the caller hasn't ever done a lookup, the bc_levels array will be
      empty, so cur->bc_levels[level].bp pointer will be NULL.  The call to
      xfs_btree_get_block will crash anyway, so the "ASSERT(block);" part is
      pointless.
      
      If the caller did a lookup but the lookup failed due to block
      corruption, the corresponding cur->bc_levels[level].bp pointer will also
      be NULL, and we'll still crash.  The "ASSERT(xfs_btree_check_block);"
      logic is also unnecessary.
      
      If the cursor level points to an inode root, the block buffer will be
      incore, so it had better always be consistent.
      
      If the caller ignores a failed lookup after a successful one and calls
      this function, the cursor state is garbage and the assert wouldn't have
      tripped anyway. So get rid of the assert.
      
      Fixes: 27d9ee57 ("xfs: actually check xfs_btree_check_block return in xfs_btree_islastblock")
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      be18cd15
  8. 07 12月, 2022 28 次提交