1. 26 4月, 2023 2 次提交
    • D
      xfs: drop async cache flushes from CIL commits. · 6e0919be
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.17-rc6
      commit 919edbad
      category: bugfix
      bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=919edbadebe17a67193533f531c2920c03e40fa4
      
      --------------------------------
      
      Jan Kara reported a performance regression in dbench that he
      bisected down to commit bad77c37 ("xfs: CIL checkpoint
      flushes caches unconditionally").
      
      Whilst developing the journal flush/fua optimisations this cache was
      part of, it appeared to made a significant difference to
      performance. However, now that this patchset has settled and all the
      correctness issues fixed, there does not appear to be any
      significant performance benefit to asynchronous cache flushes.
      
      In fact, the opposite is true on some storage types and workloads,
      where additional cache flushes that can occur from fsync heavy
      workloads have measurable and significant impact on overall
      throughput.
      
      Local dbench testing shows little difference on dbench runs with
      sync vs async cache flushes on either fast or slow SSD storage, and
      no difference in streaming concurrent async transaction workloads
      like fs-mark.
      
      Fast NVME storage.
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		 935.18   0.855		 915.64   0.903
      8		2404.51   6.873		2341.77   6.511
      16		3003.42   6.460		2931.57   6.529
      32		3697.23   7.939		3596.28   7.894
      128		7237.43  15.495		7217.74  11.588
      512		5079.24  90.587		5167.08  95.822
      
      fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
      
      	create		chown		unlink
      async   1m41s		1m16s		2m03s
      sync	1m40s		1m19s		1m54s
      
      Slower SATA SSD storage:
      
      From `dbench -t 30`, CIL scale:
      
      clients		async			sync
      		BW	Latency		BW	Latency
      1		  78.59  15.792		  83.78  10.729
      8		 367.88  92.067		 404.63  59.943
      16		 564.51  72.524		 602.71  76.089
      32		 831.66 105.984		 870.26 110.482
      128		1659.76 102.969		1624.73  91.356
      512		2135.91 223.054		2603.07 161.160
      
      fsmark, 16 threads, create w/32k logbsize
      
      	create		unlink
      async   5m06s		4m15s
      sync	5m00s		4m22s
      
      And on Jan's test machine:
      
                         5.18-rc8-vanilla       5.18-rc8-patched
      Amean     1        71.22 (   0.00%)       64.94 *   8.81%*
      Amean     2        93.03 (   0.00%)       84.80 *   8.85%*
      Amean     4       150.54 (   0.00%)      137.51 *   8.66%*
      Amean     8       252.53 (   0.00%)      242.24 *   4.08%*
      Amean     16      454.13 (   0.00%)      439.08 *   3.31%*
      Amean     32      835.24 (   0.00%)      829.74 *   0.66%*
      Amean     64     1740.59 (   0.00%)     1686.73 *   3.09%*
      
      Performance and cache flush behaviour is restored to pre-regression
      levels.
      
      As such, we can now consider the async cache flush mechanism an
      unnecessary exercise in premature optimisation and hence we can
      now remove it and the infrastructure it requires completely.
      
      Fixes: bad77c37 ("xfs: CIL checkpoint flushes caches unconditionally")
      Reported-and-tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      6e0919be
    • D
      xfs: limit iclog tail updates · c4f626f7
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.14-rc1
      commit 9d110014
      category: bugfix
      bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I6WKVJ
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9d110014205cb1129fa570d8de83d486fa199354
      
      --------------------------------
      
      From the department of "generic/482 keeps on giving", we bring you
      another tail update race condition:
      
      iclog:
      	S1			C1
      	+-----------------------+-----------------------+
      				 S2			EOIC
      
      Two checkpoints in a single iclog. One is complete, the other just
      contains the start record and overruns into a new iclog.
      
      Timeline:
      
      Before S1:	Cache flush, log tail = X
      At S1:		Metadata stable, write start record and checkpoint
      At C1:		Write commit record, set NEED_FUA
      		Single iclog checkpoint, so no need for NEED_FLUSH
      		Log tail still = X, so no need for NEED_FLUSH
      
      After C1,
      Before S2:	Cache flush, log tail = X
      At S2:		Metadata stable, write start record and checkpoint
      After S2:	Log tail moves to X+1
      At EOIC:	End of iclog, more journal data to write
      		Releases iclog
      		Not a commit iclog, so no need for NEED_FLUSH
      		Writes log tail X+1 into iclog.
      
      At this point, the iclog has tail X+1 and NEED_FUA set. There has
      been no cache flush for the metadata between X and X+1, and the
      iclog writes the new tail permanently to the log. THis is sufficient
      to violate on disk metadata/journal ordering.
      
      We have two options here. The first is to detect this case in some
      manner and ensure that the partial checkpoint write sets NEED_FLUSH
      when the iclog is already marked NEED_FUA and the log tail changes.
      This seems somewhat fragile and quite complex to get right, and it
      doesn't actually make it obvious what underlying problem it is
      actually addressing from reading the code.
      
      The second option seems much cleaner to me, because it is derived
      directly from the requirements of the C1 commit record in the iclog.
      That is, when we write this commit record to the iclog, we've
      guaranteed that the metadata/data ordering is correct for tail
      update purposes. Hence if we only write the log tail into the iclog
      for the *first* commit record rather than the log tail at the last
      release, we guarantee that the log tail does not move past where the
      the first commit record in the log expects it to be.
      
      IOWs, taking the first option means that replay of C1 becomes
      dependent on future operations doing the right thing, not just the
      C1 checkpoint itself doing the right thing. This makes log recovery
      almost impossible to reason about because now we have to take into
      account what might or might not have happened in the future when
      looking at checkpoints in the log rather than just having to
      reconstruct the past...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      c4f626f7
  2. 19 4月, 2023 1 次提交
  3. 12 4月, 2023 1 次提交
    • D
      xfs: log worker needs to start before intent/unlink recovery · e5870eee
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.17-rc6
      commit a9a4bc8c
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a9a4bc8c76d747aa40b30e2dfc176c781f353a08
      
      --------------------------------
      
      After 963 iterations of generic/530, it deadlocked during recovery
      on a pinned inode cluster buffer like so:
      
      XFS (pmem1): Starting recovery (logdev: internal)
      INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
            Not tainted 5.17.0-rc6-dgc+ #975
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
      Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iunlink_map_ino+0x66/0xd0
       xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
       xfs_iunlink_remove_inode+0xf2/0x1d0
       xfs_inactive_ifree+0x1a3/0x900
       xfs_inode_unlink+0xcc/0x210
       xfs_inodegc_worker+0x1ac/0x2f0
       process_one_work+0x1ac/0x390
       worker_thread+0x56/0x3c0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
       </TASK>
      task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       schedule_timeout+0x114/0x160
       __down+0x99/0xf0
       down+0x5e/0x70
       xfs_buf_lock+0x36/0xf0
       xfs_buf_find+0x418/0x850
       xfs_buf_get_map+0x47/0x380
       xfs_buf_read_map+0x54/0x240
       xfs_trans_read_buf_map+0x1bd/0x490
       xfs_imap_to_bp+0x4f/0x70
       xfs_iget+0x300/0xb40
       xlog_recover_process_one_iunlink+0x4c/0x170
       xlog_recover_process_iunlinks.isra.0+0xee/0x130
       xlog_recover_finish+0x57/0x110
       xfs_log_mount_finish+0xfc/0x1e0
       xfs_mountfs+0x540/0x910
       xfs_fs_fill_super+0x495/0x850
       get_tree_bdev+0x171/0x270
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x304/0xba0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
       </TASK>
      task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
      Call Trace:
       <TASK>
       __schedule+0x30d/0x9e0
       schedule+0x55/0xd0
       io_schedule+0x4b/0x80
       xfs_buf_wait_unpin+0x9e/0xf0
       __xfs_buf_submit+0x14a/0x230
       xfs_buf_delwri_submit_buffers+0x107/0x280
       xfs_buf_delwri_submit_nowait+0x10/0x20
       xfsaild+0x27e/0x9d0
       kthread+0xf6/0x120
       ret_from_fork+0x1f/0x30
      
      We have the mount process waiting on an inode cluster buffer read,
      inodegc doing unlink waiting on the same inode cluster buffer, and
      the AIL push thread blocked in writeback waiting for the inode
      cluster buffer to become unpinned.
      
      What has happened here is that the AIL push thread has raced with
      the inodegc process modifying, committing and pinning the inode
      cluster buffer here in xfs_buf_delwri_submit_buffers() here:
      
      	blk_start_plug(&plug);
      	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
      		if (!wait_list) {
      			if (xfs_buf_ispinned(bp)) {
      				pinned++;
      				continue;
      			}
      Here >>>>>>
      			if (!xfs_buf_trylock(bp))
      				continue;
      
      Basically, the AIL has found the buffer wasn't pinned and got the
      lock without blocking, but then the buffer was pinned. This implies
      the processing here was pre-empted between the pin check and the
      lock, because the pin count can only be increased while holding the
      buffer locked. Hence when it has gone to submit the IO, it has
      blocked waiting for the buffer to be unpinned.
      
      With all executing threads now waiting on the buffer to be unpinned,
      we normally get out of situations like this via the background log
      worker issuing a log force which will unpinned stuck buffers like
      this. But at this point in recovery, we haven't started the log
      worker. In fact, the first thing we do after processing intents and
      unlinked inodes is *start the log worker*. IOWs, we start it too
      late to have it break deadlocks like this.
      
      Avoid this and any other similar deadlock vectors in intent and
      unlinked inode recovery by starting the log worker before we recover
      intents and unlinked inodes. This part of recovery runs as though
      the filesystem is fully active, so we really should have the same
      infrastructure running as we normally do at runtime.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NChandan Babu R <chandan.babu@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NLong Li <leo.lilong@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      e5870eee
  4. 06 1月, 2023 1 次提交
    • G
      xfs: wait iclog complete before tearing down AIL · fabfebe7
      Guo Xuenan 提交于
      mainline inclusion
      from mainline-v6.1-rc4
      commit 1eb52a6a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1eb52a6a71981b80f9acbd915acd6a05a5037196
      
      --------------------------------
      
      Fix uaf in xfs_trans_ail_delete during xlog force shutdown.
      In commit cd6f79d1 ("xfs: run callbacks before waking waiters in
      xlog_state_shutdown_callbacks") changed the order of running callbacks
      and wait for iclog completion to avoid unmount path untimely destroy AIL.
      But which seems not enough to ensue this, adding mdelay in
      `xfs_buf_item_unpin` can prove that.
      
      The reproduction is as follows. To ensure destroy AIL safely,
      we should wait all xlog ioend workers done and sync the AIL.
      
      ==================================================================
      BUG: KASAN: use-after-free in xfs_trans_ail_delete+0x240/0x2a0
      Read of size 8 at addr ffff888023169400 by task kworker/1:1H/43
      
      CPU: 1 PID: 43 Comm: kworker/1:1H Tainted: G        W
      6.1.0-rc1-00002-gc28266863c4a #137
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      1.13.0-1ubuntu1.1 04/01/2014
      Workqueue: xfs-log/sda xlog_ioend_work
      Call Trace:
       <TASK>
       dump_stack_lvl+0x4d/0x66
       print_report+0x171/0x4a6
       kasan_report+0xb3/0x130
       xfs_trans_ail_delete+0x240/0x2a0
       xfs_buf_item_done+0x7b/0xa0
       xfs_buf_ioend+0x1e9/0x11f0
       xfs_buf_item_unpin+0x4c8/0x860
       xfs_trans_committed_bulk+0x4c2/0x7c0
       xlog_cil_committed+0xab6/0xfb0
       xlog_cil_process_committed+0x117/0x1e0
       xlog_state_shutdown_callbacks+0x208/0x440
       xlog_force_shutdown+0x1b3/0x3a0
       xlog_ioend_work+0xef/0x1d0
       process_one_work+0x6f9/0xf70
       worker_thread+0x578/0xf30
       kthread+0x28c/0x330
       ret_from_fork+0x1f/0x30
       </TASK>
      
      Allocated by task 9606:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       __kasan_kmalloc+0x7a/0x90
       __kmalloc+0x59/0x140
       kmem_alloc+0xb2/0x2f0
       xfs_trans_ail_init+0x20/0x320
       xfs_log_mount+0x37e/0x690
       xfs_mountfs+0xe36/0x1b40
       xfs_fs_fill_super+0xc5c/0x1a70
       get_tree_bdev+0x3c5/0x6c0
       vfs_get_tree+0x85/0x250
       path_mount+0xec3/0x1830
       do_mount+0xef/0x110
       __x64_sys_mount+0x150/0x1f0
       do_syscall_64+0x35/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      Freed by task 9662:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_save_free_info+0x2a/0x40
       __kasan_slab_free+0x105/0x1a0
       __kmem_cache_free+0x99/0x2d0
       kvfree+0x3a/0x40
       xfs_log_unmount+0x60/0xf0
       xfs_unmountfs+0xf3/0x1d0
       xfs_fs_put_super+0x78/0x300
       generic_shutdown_super+0x151/0x400
       kill_block_super+0x9a/0xe0
       deactivate_locked_super+0x82/0xe0
       deactivate_super+0x91/0xb0
       cleanup_mnt+0x32a/0x4a0
       task_work_run+0x15f/0x240
       exit_to_user_mode_prepare+0x188/0x190
       syscall_exit_to_user_mode+0x12/0x30
       do_syscall_64+0x42/0x80
       entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      The buggy address belongs to the object at ffff888023169400
       which belongs to the cache kmalloc-128 of size 128
      The buggy address is located 0 bytes inside of
       128-byte region [ffff888023169400, ffff888023169480)
      
      The buggy address belongs to the physical page:
      page:ffffea00008c5a00 refcount:1 mapcount:0 mapping:0000000000000000
      index:0xffff888023168f80 pfn:0x23168
      head:ffffea00008c5a00 order:1 compound_mapcount:0 compound_pincount:0
      flags: 0x1fffff80010200(slab|head|node=0|zone=1|lastcpupid=0x1fffff)
      raw: 001fffff80010200 ffffea00006b3988 ffffea0000577a88 ffff88800f842ac0
      raw: ffff888023168f80 0000000000150007 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff888023169300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888023169380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff888023169400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                         ^
       ffff888023169480: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff888023169500: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      ==================================================================
      Disabling lock debugging due to kernel taint
      
      Fixes: cd6f79d1 ("xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks")
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      (cherry picked from commit 1146fdf4)
      fabfebe7
  5. 21 11月, 2022 2 次提交
    • D
      xfs: prevent a UAF when log IO errors race with unmount · 02133c58
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-v5.16-rc3
      commit 7561cea5
      category: bugfix
      bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7561cea5dbb97fecb952548a0fb74fb105bf4664
      
      --------------------------------
      
      KASAN reported the following use after free bug when running
      generic/475:
      
       XFS (dm-0): Mounting V5 Filesystem
       XFS (dm-0): Starting recovery (logdev: internal)
       XFS (dm-0): Ending recovery (logdev: internal)
       Buffer I/O error on dev dm-0, logical block 20639616, async page read
       Buffer I/O error on dev dm-0, logical block 20639617, async page read
       XFS (dm-0): log I/O error -5
       XFS (dm-0): Filesystem has been shut down due to log error (0x2).
       XFS (dm-0): Unmounting Filesystem
       XFS (dm-0): Please unmount the filesystem and rectify the problem(s).
       ==================================================================
       BUG: KASAN: use-after-free in do_raw_spin_lock+0x246/0x270
       Read of size 4 at addr ffff888109dd84c4 by task 3:1H/136
      
       CPU: 3 PID: 136 Comm: 3:1H Not tainted 5.19.0-rc4-xfsx #rc4 8e53ab5ad0fddeb31cee5e7063ff9c361915a9c4
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
       Workqueue: xfs-log/dm-0 xlog_ioend_work [xfs]
       Call Trace:
        <TASK>
        dump_stack_lvl+0x34/0x44
        print_report.cold+0x2b8/0x661
        ? do_raw_spin_lock+0x246/0x270
        kasan_report+0xab/0x120
        ? do_raw_spin_lock+0x246/0x270
        do_raw_spin_lock+0x246/0x270
        ? rwlock_bug.part.0+0x90/0x90
        xlog_force_shutdown+0xf6/0x370 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        xlog_ioend_work+0x100/0x190 [xfs 4ad76ae0d6add7e8183a553e624c31e9ed567318]
        process_one_work+0x672/0x1040
        worker_thread+0x59b/0xec0
        ? __kthread_parkme+0xc6/0x1f0
        ? process_one_work+0x1040/0x1040
        ? process_one_work+0x1040/0x1040
        kthread+0x29e/0x340
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
      
       Allocated by task 154099:
        kasan_save_stack+0x1e/0x40
        __kasan_kmalloc+0x81/0xa0
        kmem_alloc+0x8d/0x2e0 [xfs]
        xlog_cil_init+0x1f/0x540 [xfs]
        xlog_alloc_log+0xd1e/0x1260 [xfs]
        xfs_log_mount+0xba/0x640 [xfs]
        xfs_mountfs+0xf2b/0x1d00 [xfs]
        xfs_fs_fill_super+0x10af/0x1910 [xfs]
        get_tree_bdev+0x383/0x670
        vfs_get_tree+0x7d/0x240
        path_mount+0xdb7/0x1890
        __x64_sys_mount+0x1fa/0x270
        do_syscall_64+0x2b/0x80
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
       Freed by task 154151:
        kasan_save_stack+0x1e/0x40
        kasan_set_track+0x21/0x30
        kasan_set_free_info+0x20/0x30
        ____kasan_slab_free+0x110/0x190
        slab_free_freelist_hook+0xab/0x180
        kfree+0xbc/0x310
        xlog_dealloc_log+0x1b/0x2b0 [xfs]
        xfs_unmountfs+0x119/0x200 [xfs]
        xfs_fs_put_super+0x6e/0x2e0 [xfs]
        generic_shutdown_super+0x12b/0x3a0
        kill_block_super+0x95/0xd0
        deactivate_locked_super+0x80/0x130
        cleanup_mnt+0x329/0x4d0
        task_work_run+0xc5/0x160
        exit_to_user_mode_prepare+0xd4/0xe0
        syscall_exit_to_user_mode+0x1d/0x40
        entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      This appears to be a race between the unmount process, which frees the
      CIL and waits for in-flight iclog IO; and the iclog IO completion.  When
      generic/475 runs, it starts fsstress in the background, waits a few
      seconds, and substitutes a dm-error device to simulate a disk falling
      out of a machine.  If the fsstress encounters EIO on a pure data write,
      it will exit but the filesystem will still be online.
      
      The next thing the test does is unmount the filesystem, which tries to
      clean the log, free the CIL, and wait for iclog IO completion.  If an
      iclog was being written when the dm-error switch occurred, it can race
      with log unmounting as follows:
      
      Thread 1				Thread 2
      
      					xfs_log_unmount
      					xfs_log_clean
      					xfs_log_quiesce
      xlog_ioend_work
      <observe error>
      xlog_force_shutdown
      test_and_set_bit(XLOG_IOERROR)
      					xfs_log_force
      					<log is shut down, nop>
      					xfs_log_umount_write
      					<log is shut down, nop>
      					xlog_dealloc_log
      					xlog_cil_destroy
      					<wait for iclogs>
      spin_lock(&log->l_cilp->xc_push_lock)
      <KABOOM>
      
      Therefore, free the CIL after waiting for the iclogs to complete.  I
      /think/ this race has existed for quite a few years now, though I don't
      remember the ~2014 era logging code well enough to know if it was a real
      threat then or if the actual race was exposed only more recently.
      
      Fixes: ac983517 ("xfs: don't sleep in xlog_cil_force_lsn on shutdown")
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      02133c58
    • D
      xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks · 2052264c
      Dave Chinner 提交于
      mainline inclusion
      from mainline-v5.16-rc3
      commit cd6f79d1
      category: bugfix
      bugzilla: 187526,https://gitee.com/openeuler/kernel/issues/I4KIAO
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cd6f79d1fb324968a3bae92f82eeb7d28ca1fd22
      
      --------------------------------
      
      Brian reported a null pointer dereference failure during unmount in
      xfs/006. He tracked the problem down to the AIL being torn down
      before a log shutdown had completed and removed all the items from
      the AIL. The failure occurred in this path while unmount was
      proceeding in another task:
      
       xfs_trans_ail_delete+0x102/0x130 [xfs]
       xfs_buf_item_done+0x22/0x30 [xfs]
       xfs_buf_ioend+0x73/0x4d0 [xfs]
       xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
       xlog_cil_committed+0x2a9/0x300 [xfs]
       xlog_cil_process_committed+0x69/0x80 [xfs]
       xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
       xlog_force_shutdown+0xdf/0x150 [xfs]
       xfs_do_force_shutdown+0x5f/0x150 [xfs]
       xlog_ioend_work+0x71/0x80 [xfs]
       process_one_work+0x1c5/0x390
       worker_thread+0x30/0x350
       kthread+0xd7/0x100
       ret_from_fork+0x1f/0x30
      
      This is processing an EIO error to a log write, and it's
      triggering a force shutdown. This causes the log to be shut down,
      and then it is running attached iclog callbacks from the shutdown
      context. That means the fs and log has already been marked as
      xfs_is_shutdown/xlog_is_shutdown and so high level code will abort
      (e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error
      because of shutdown.
      
      The umount would have been blocked waiting for a log force
      completion inside xfs_log_cover() -> xfs_sync_sb(). The first thing
      for this situation to occur is for xfs_sync_sb() to exit without
      waiting for the iclog buffer to be comitted to disk. The
      above trace is the completion routine for the iclog buffer, and
      it is shutting down the filesystem.
      
      xlog_state_shutdown_callbacks() does this:
      
      {
              struct xlog_in_core     *iclog;
              LIST_HEAD(cb_list);
      
              spin_lock(&log->l_icloglock);
              iclog = log->l_iclog;
              do {
                      if (atomic_read(&iclog->ic_refcnt)) {
                              /* Reference holder will re-run iclog callbacks. */
                              continue;
                      }
                      list_splice_init(&iclog->ic_callbacks, &cb_list);
      >>>>>>           wake_up_all(&iclog->ic_write_wait);
      >>>>>>           wake_up_all(&iclog->ic_force_wait);
              } while ((iclog = iclog->ic_next) != log->l_iclog);
      
              wake_up_all(&log->l_flush_wait);
              spin_unlock(&log->l_icloglock);
      
      >>>>>>  xlog_cil_process_committed(&cb_list);
      }
      
      This wakes any thread waiting on IO completion of the iclog (in this
      case the umount log force) before shutdown processes all the pending
      callbacks.  That means the xfs_sync_sb() waiting on a sync
      transaction in xfs_log_force() on iclog->ic_force_wait will get
      woken before the callbacks attached to that iclog are run. This
      results in xfs_sync_sb() returning an error, and so unmount unblocks
      and continues to run whilst the log shutdown is still in progress.
      
      Normally this is just fine because the force waiter has nothing to
      do with AIL operations. But in the case of this unmount path, the
      log force waiter goes on to tear down the AIL because the log is now
      shut down and so nothing ever blocks it again from the wait point in
      xfs_log_cover().
      
      Hence it's a race to see who gets to the AIL first - the unmount
      code or xlog_cil_process_committed() killing the superblock buffer.
      
      To fix this, we just have to change the order of processing in
      xlog_state_shutdown_callbacks() to run the callbacks before it wakes
      any task waiting on completion of the iclog.
      Reported-by: NBrian Foster <bfoster@redhat.com>
      Fixes: aad7272a ("xfs: separate out log shutdown callback processing")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      2052264c
  6. 29 9月, 2022 1 次提交
  7. 09 3月, 2022 13 次提交
  8. 07 1月, 2022 3 次提交
    • D
      xfs: AIL needs asynchronous CIL forcing · 854c6d59
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.14-rc4
      commit 0020a190
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0020a190cf3eac16995143db41b21b82bacdcbe3
      
      -------------------------------------------------
      
      The AIL pushing is stalling on log forces when it comes across
      pinned items. This is happening on removal workloads where the AIL
      is dominated by stale items that are removed from AIL when the
      checkpoint that marks the items stale is committed to the journal.
      This results is relatively few items in the AIL, but those that are
      are often pinned as directories items are being removed from are
      still being logged.
      
      As a result, many push cycles through the CIL will first issue a
      blocking log force to unpin the items. This can take some time to
      complete, with tracing regularly showing push delays of half a
      second and sometimes up into the range of several seconds. Sequences
      like this aren't uncommon:
      
      ....
       399.829437:  xfsaild: last lsn 0x11002dd000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 270ms delay>
       400.099622:  xfsaild: target 0x11002f3600, prev 0x11002f3600, last lsn 0x0
       400.099623:  xfsaild: first lsn 0x11002f3600
       400.099679:  xfsaild: last lsn 0x1100305000 count 16 stuck 11 flushing 0 tout 50
      <wanted 50ms, got 500ms delay>
       400.589348:  xfsaild: target 0x110032e600, prev 0x11002f3600, last lsn 0x0
       400.589349:  xfsaild: first lsn 0x1100305000
       400.589595:  xfsaild: last lsn 0x110032e600 count 156 stuck 101 flushing 30 tout 50
      <wanted 50ms, got 460ms delay>
       400.950341:  xfsaild: target 0x1100353000, prev 0x110032e600, last lsn 0x0
       400.950343:  xfsaild: first lsn 0x1100317c00
       400.950436:  xfsaild: last lsn 0x110033d200 count 105 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 200ms delay>
       401.142333:  xfsaild: target 0x1100361600, prev 0x1100353000, last lsn 0x0
       401.142334:  xfsaild: first lsn 0x110032e600
       401.142535:  xfsaild: last lsn 0x1100353000 count 122 stuck 101 flushing 8 tout 10
      <wanted 10ms, got 10ms delay>
       401.154323:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x1100353000
       401.154328:  xfsaild: first lsn 0x1100353000
       401.154389:  xfsaild: last lsn 0x1100353000 count 101 stuck 101 flushing 0 tout 20
      <wanted 20ms, got 300ms delay>
       401.451525:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
       401.451526:  xfsaild: first lsn 0x1100353000
       401.451804:  xfsaild: last lsn 0x1100377200 count 170 stuck 22 flushing 122 tout 50
      <wanted 50ms, got 500ms delay>
       401.933581:  xfsaild: target 0x1100361600, prev 0x1100361600, last lsn 0x0
      ....
      
      In each of these cases, every AIL pass saw 101 log items stuck on
      the AIL (pinned) with very few other items being found. Each pass, a
      log force was issued, and delay between last/first is the sleep time
      + the sync log force time.
      
      Some of these 101 items pinned the tail of the log. The tail of the
      log does slowly creep forward (first lsn), but the problem is that
      the log is actually out of reservation space because it's been
      running so many transactions that stale items that never reach the
      AIL but consume log space. Hence we have a largely empty AIL, with
      long term pins on items that pin the tail of the log that don't get
      pushed frequently enough to keep log space available.
      
      The problem is the hundreds of milliseconds that we block in the log
      force pushing the CIL out to disk. The AIL should not be stalled
      like this - it needs to run and flush items that are at the tail of
      the log with minimal latency. What we really need to do is trigger a
      log flush, but then not wait for it at all - we've already done our
      waiting for stuff to complete when we backed off prior to the log
      force being issued.
      
      Even if we remove the XFS_LOG_SYNC from the xfs_log_force() call, we
      still do a blocking flush of the CIL and that is what is causing the
      issue. Hence we need a new interface for the CIL to trigger an
      immediate background push of the CIL to get it moving faster but not
      to wait on that to occur. While the CIL is pushing, the AIL can also
      be pushing.
      
      We already have an internal interface to do this -
      xlog_cil_push_now() - but we need a wrapper for it to be used
      externally. xlog_cil_force_seq() can easily be extended to do what
      we need as it already implements the synchronous CIL push via
      xlog_cil_push_now(). Add the necessary flags and "push current
      sequence" semantics to xlog_cil_force_seq() and convert the AIL
      pushing to use it.
      
      One of the complexities here is that the CIL push does not guarantee
      that the commit record for the CIL checkpoint is written to disk.
      The current log force ensures this by submitting the current ACTIVE
      iclog that the commit record was written to. We need the CIL to
      actually write this commit record to disk for an async push to
      ensure that the checkpoint actually makes it to disk and unpins the
      pinned items in the checkpoint on completion. Hence we need to pass
      down to the CIL push that we are doing an async flush so that it can
      switch out the commit_iclog if necessary to get written to disk when
      the commit iclog is finally released.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NAllison Henderson <allison.henderson@oracle.com>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      854c6d59
    • D
      xfs: add iclog state trace events · 5dc41107
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.13-rc4
      commit 956f6daa
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=956f6daa84bf50dd5bd13a64b57cae446bca3899
      
      -------------------------------------------------
      
      For the DEBUGS!
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5dc41107
    • D
      xfs: set WQ_SYSFS on all workqueues in debug mode · 469a470c
      Darrick J. Wong 提交于
      mainline-inclusion
      from mainline-v5.11-rc4
      commit 05a302a1
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=05a302a17062ca73dc91b508cf2a0b25724db15d
      
      -------------------------------------------------
      
      When CONFIG_XFS_DEBUG=y, set WQ_SYSFS on all workqueues that we create
      so that we (developers) have a means to monitor cpu affinity and whatnot
      for background workers.  In the next patchset we'll expose knobs for
      more of the workqueues publicly and document it, but not now.
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      469a470c
  9. 27 12月, 2021 16 次提交